Load the major data set

Data for extreme weather events and casualties from extreme weather was obtained from the National Climatic Data Centre archive (each one set). A total set comprises of data for each year for the period from 1950-2017 for both, extreme weather events and casualties. A Python script was used to scrape all data sets. First of all, all data sets (extreme weather events and casualties from extreme weather) will be loaded and combined into a master data frame (df_events_victims), which will be used for all downstream analyses. For more information, please refer to the readme file (readme.md).

# Load all csv files and combine victims_data with storm_events via EVENT_ID
combine_data = function(range, path_events = "storm_data/events_data/", path_victims = "storm_data/victims_data/"){
  combined_events = data.frame()
  combined_victims = data.frame()
  for(year in range){
    #print(year)
    for(csv_file in list.files(path_events)) {
      regexpr = regexec(as.character(year), csv_file)     # regex expression, if you want to load data only for a particular period (start year-end year). requires zip files and csv files to be in seperated folders. Otherwise regex will match both and the final dataframe will contain duplicated data.
      match = length(regmatches(csv_file,regexpr)[[1]])
     
      if(match){
        file_in_events = read.csv(paste(path_events,csv_file,sep=""), colClasses = c(rep(NA,48),rep("NULL",3)))  # exclude the last 3 columns (episode_narrative, event_narrative,data_source)
        colnames(file_in_events)[12]="MONTH" # change column name MONTH_NAME to MONTH
        file_in_events$MONTH = month(dmy_hms(file_in_events$BEGIN_DATE_TIME, locale = "en_US.UTF-8")) # change month string representation to integer in MONTH 
        file_in_victims = read.csv(paste(path_victims,gsub("Storm_Event","Storm_Fatalities",csv_file),sep=""), colClasses = c(rep("NULL",4),NA,rep("NULL",2), rep(NA,3),"NULL"))  # include only fatality_age and fatality_gender columns
        if(is.integer(file_in_events$DAMAGE_CROPS)){
          file_in_events$DAMAGE_CROPS = as.character(file_in_events$DAMAGE_CROPS) # bind_rows tries to convert to factor if coltype is integer, which throws an error!! convert to chr
        }
        combined_events = bind_rows(combined_events, file_in_events)
        combined_victims = bind_rows(combined_victims, file_in_victims)
      }
    }
  }
  
  return(left_join(combined_events, combined_victims, by = "EVENT_ID"))
}

# providing range in years in the combine_data() function defines which years to include in the master data frame
df_events_victims = combine_data(1950:2017)

# optional, write new csv file
# write.csv(df_events_victims, file = "storm_data/all_storm_events_victims.csv", row.names=FALSE)

Introduction

There is compelling evidence that supports the existence of climate change (e.g. https://climate.nasa.gov/evidence/). However, it is very challenging to determine the long-term social and economic consequences for the global community. Especially, because the transformation of climate change into socioeconomic effects takes place gradually over the period of years, decades and centuries, it becomes quite abstract and difficult to grasp for the society. Especially, because leading industrial countries can yet cope with these consequences by means of technological advancement and economic power. But mostly, this is due to the earth’s capability of resource regeneration in combination with existing resource capacities. However, the earth’s buffering effect is only limiting if we don’t change the lifestyle of western industrial nations (Moreover, be aware that less developed countries, which are prevalent in terms of quantity, seek to reach our economic status). As such, I strongly believe that the society (including myself!!) is given a false sense of security and tends to be ignorant about the true consequences for the all the generations to come after us!! Nevertheless, I also strongly believe that we can make the global transition from non-sustainable to sustainable society.

So I am very motivated to explore the data provided and translate relevant information in an insightful way in order to make climate change and cognate consequences less abstract and, thus, contributing to raise the awareness for this very important topic that affects each one of us!! More precisely, I am using time series data on global temperatures, carbon dioxide concentration and extreme weather events aiming at analysing whether such extreme conditions became more frequent over the years and if there is a correlation with global warming. The analysis will be focusing on the United States.

Number of different extreme weather types in the dataset

Let’s visualize the unique categories of storm events present in the dataset, and plot the total number of occurrences from 1950-2017 for each category.

options(scipen=999) # scipen is used to suppress scientific notation in R (e+ notation for numbers); for more information check R help

ggplot(aes(x=EVENT_TYPE, fill = "red"), data = df_events_victims) + geom_bar() +
  guides(fill=guide_legend(ncol=8)) +
  scale_y_sqrt(breaks=c(0,50000,100000,200000,300000,400000)) +
  xlab("Type of Extreme Weather")+
  ylab("Number of Events") +
  geom_hline(yintercept = 50000, linetype = 2, alpha = 0.3) +
  ggtitle("Total Number of Extreme Weather Events by default Categories")+
  guides(fill=F) +
  theme_minimal() +
  theme(plot.title= element_text(hjust=0.5, size = 14, face = "bold"), axis.text.x = element_text(angle = 45, hjust = 1, size = 10), 
        axis.text.y = element_text(size = 10), axis.title =element_text(size=12)) 

Hail and Thunderstorm Wind are the most frequent categories over the time period 1950-2017. In general, “less dramatic” manifestations of extreme weather that belong to categories related to “Storm”, “Flood,”Winter Weather" are overall the most frequent. This is obviously to be expected, since some events, like Tsunamis or Volcanic Eruptions, are simply less likely to be triggered as opposed to the aforementioned events. The very high total number of Tornado occurrences (hline in the above plot depicts count = 50000) really surprised me (probably, because Tornados are extremely rare in Europe). In contrast, I am surprised that no earthquakes were documented. It is evident that there is some redundancy at the level of event type. E.g., there are multiple subcategories for Thunderstorms, Floods, Hail etc.. For further analysis with respect to storm event types I will try to reduce redundancy by grouping events. This will ensure a better overview in all subsequent analysis that include weather events.

Reducing redundancy in storm events

Before deciding on the grouping pattern, I have to decide whether some events should be excluded from the data set. In particular, I am only interested to analyse the dynamics of extreme weather events that, according to present knowledge, can be correlated with negative effects on society, nature and economy. However, there is even a documented correlation between high Northern Lights activity and the collapse of the Quebec power grid (http://www.bbc.com/future/story/20130410-the-greatest-show-on-earth). This is also true for Tides that are described to have far-reaching consequences for ecosystems and society (https://oceanservice.noaa.gov/education/pd/tidescurrents/tidescurrents_effects_influences.html). As such, I won’t exclude any event type.

Weather categories proposed by the NSSL (http://www.nssl.noaa.gov/education/svrwx101/) were extended by the groups “water column” and “other” to define following major groups:

  1. Storm
  2. Tornados
  3. Floods
  4. Hail
  5. Damaging Winds
  6. Winter Weather
  7. Tide
  8. Other

Semantics of recorded event types were considered to assign each event to the appropriate group. E.g., all names containing the word “Thunderstorm” were combined into the group “storm”, or any event having characteristics of a storm, such as “Hurricane” and “Lightning”. Of course, the underlying classification is to some extent arbitrary and other classifications are also possible. Single events that have no obvious association to the first 7 categories were grouped into the “other” group.

category_storm = c("Thunderstorm Wind", "THUNDERSTORM WINDS/FLOODING", "THUNDERSTORM WINDS/FLASH FLOOD", "THUNDERSTORM WINDS LIGHTNING", "THUNDERSTORM WIND/ TREES", "THUNDERSTORM WIND/ TREE", "THUNDERSTORM WINDS FUNNEL CLOU", "THUNDERSTORM WINDS/HEAVY RAIN", "THUNDERSTORM WINDS HEAVY RAIN", "THUNDERSTORM WINDS/ FLOOD", "Lightning", "Tropical Storm", "Hurricane (Typhoon)", "Heavy Rain", "Hurricane (Typhoon)","Marine Thunderstorm Wind", "Tropical Depression", "Hurricane", "Marine Tropical Storm", "Marine Hurricane/Typhoon", "Marine Lightning",  "Marine Tropical Depression")
category_flood = c("Flash Flood", "Flood", "Coastal Flood", "Storm Surge/Tide","Tsunami", "Lakeshore Flood")
category_hail = c("Hail", "HAIL/ICY ROADS","HAIL FLOODING", "Marine Hail")
category_tornado =c("Tornado", "TORNADOES, TSTM WIND, HAIL", "TORNADO/WATERSPOUT", "Funnel Cloud", "Waterspout", "Dust Devil")
category_wind = c("High Wind", "Strong Wind", "Marine High Wind", "Heavy Wind", "Marine Strong Wind")
category_winter = c("Winter Storm", "Blizzard", "Cold/Wind Chill", "Heavy Snow", "Ice Storm", "Winter Weather", "Avalanche", "Frost/Freeze", "Freezing Fog", "Sleet", "Extreme Cold/Wind Chill", "High Snow")
category_drought = c("Heat", "Wildfire", "Dust Storm", "Drought", "Excessive Heat")
category_tide = c("High Surf", "Rip Current", "Astronomical Low Tide", "Seiche", "Sneakerwave")
category_other = c("Debris Flow", "Lake-Effect Snow", "Volcanic Ash", "OTHER", "Northern Lights", "Dense Smoke", "Landslide", "Volcanic Ashfall", "Marine Dense Fog", "Dense Fog")

# hash the categories (similar to dictionary object in python); Allows easy and fast categorization of sub-categories (e.g., "Heat") into corresponding main category (e.g., "Drought")
major_categories = hash(Storm=category_storm, Flood=category_flood, Hail=category_hail, Tornado=category_tornado, Wind=category_wind, Winter=category_winter ,Drought=category_drought, Tide=category_tide, Other=category_other)

# retrieve the main category for each extreme weather sub-category
classification = function(event){
  output = NULL
  for(key in keys(major_categories)){
    if (event %in% major_categories[[key]]){
      output = key
    }
    else if(is.null(output)){
      output = NA
    }
  }
  output
}

# apply new categorization of extreme weather events to the main data set
df_events_victims["EVENT_CATEGORY"] = apply(subset(df_events_victims,select = c("EVENT_TYPE")), 1, function(x) classification(x))
p1 = ggplot(aes(x=EVENT_CATEGORY, fill = "red"), data = df_events_victims) + geom_bar() +
  theme(legend.position="bottom", axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill=guide_legend(ncol=8)) +
  scale_y_sqrt() +
  xlab("Category") +
  ylab("Number of Events") +
  ggtitle("Total Number of Extreme Weather Events\nby new Categorization") +
  guides(fill=F) +
  theme_minimal() +
  theme(plot.title= element_text(hjust=0.5, size = 14, face = "bold"), axis.text.x = element_text(angle = 45, hjust = 1,size = 10),  
        axis.text.y = element_text(size = 10), axis.title =element_text(size=12))

# load "RdYlBu" color palette with 9 distinct colors (for each category)
getPalette = colorRampPalette(brewer.pal(9,"RdYlBu"))

p2 = ggplot(aes(x=MONTH), data = df_events_victims) + geom_bar(aes(fill = EVENT_CATEGORY)) +
  xlab("Month") +
  ylab("Number of Events") +
  ggtitle("Seasonal Distribution for Extreme Weather Categories")+
  theme_minimal() +
  scale_x_continuous(limits = c(0,13), breaks = seq(1,12,1), labels = c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")) +
  scale_fill_manual(values=getPalette(9), guide = guide_legend(title = "Category")) +
  theme(plot.title= element_text(hjust=0.5, size = 14, face = "bold"), axis.text.x = element_text(angle = 45, hjust = 1, size = 10),  
        axis.text.y = element_text(size = 10), axis.title =element_text(size=12), legend.text=element_text(size=10), legend.title=element_text(size=12))


grid.arrange(p1,p2,ncol=2, widths=c(5,6))

After merging similar events into major categories, we can see that events related to Hail, Storm and Winter Weather are by far the most frequent categories over the period 1950-2017, followed by events related to Flood, Drought, Tornado and Wind. The least frequent events belong to major categories Tide and Other. Let’s analyse the seasonal distribution for each event category. As we can see, different types of extreme weather have a specific seasonal distribution. E.g., the major Storm season is between May and August, whereas Hail is predominantly occurring between April and July and as expected, Winter category events mostly occur in winter months December, January and February.

Casualties caused by extreme weather

The data set provides information about injuries and death toll caused by extreme weather events. I want to analyse the distribution of such incidents (average casualties etc.) and how many such incidents have been recorded in total.

# combines direct and indirect casualties for each type (injuries vs. fatalities) 
casualties_reshape = subset(df_events_victims, select = c("STATE", "STATE_FIPS","INJURIES_DIRECT", "INJURIES_INDIRECT", "DEATHS_DIRECT", "DEATHS_INDIRECT", "FATALITY_AGE", "FATALITY_SEX")) %>%
  # state fips have two-digit format; formatC() preserves 2-digit format and adds a 0 if entry is of 1-digit type
  transform(FIPS = formatC(STATE_FIPS, digits=2,width=2,format = "d", flag=0)  ) %>%
  # converts NA to 0 for one casualty category (here, indirect vs. direct) if other category is not NA; This avoids NA result when summing number and NA 
  mutate_at(vars(DEATHS_DIRECT), funs(ifelse(!is.na(DEATHS_INDIRECT) & (is.na(DEATHS_DIRECT) | DEATHS_DIRECT==0),0, DEATHS_DIRECT ))) %>%
  mutate_at(vars(DEATHS_INDIRECT), funs(ifelse(!is.na(DEATHS_DIRECT) & (is.na(DEATHS_INDIRECT) | DEATHS_INDIRECT==0),0, DEATHS_INDIRECT ))) %>%
  mutate_at(vars(INJURIES_DIRECT), funs(ifelse(!is.na(INJURIES_INDIRECT) & (is.na(INJURIES_DIRECT) | INJURIES_DIRECT==0),0, INJURIES_DIRECT ))) %>%
  mutate_at(vars(INJURIES_INDIRECT), funs(ifelse(!is.na(INJURIES_DIRECT) & (is.na(INJURIES_INDIRECT) | INJURIES_INDIRECT==0),0, INJURIES_INDIRECT ))) %>%
  transform(Fatalities = DEATHS_DIRECT + DEATHS_INDIRECT, Injuries = INJURIES_DIRECT + INJURIES_INDIRECT)
 
# total counts for each category (injuries vs. fatalities)
casualties_total = casualties_reshape %>% 
  select(c("Fatalities", "Injuries")) %>%
  gather(key = "casualty", value = "count",1:2) %>%
  group_by(casualty) %>%
  summarise(total_counts = sum(count))

# number of casualties for each extreme weather event (used to plot the distribution of casualties)
casualties_by_category = casualties_reshape %>%
  select(c("Fatalities", "Injuries")) %>%
  gather(key = "casualty", value = "count",1:2)

# number of casualties by age and gender  
casualties_by_age_gender = subset(casualties_reshape, !is.na(casualties_reshape$FATALITY_AGE))  %>%
  select(c("FATALITY_SEX","FATALITY_AGE", "Fatalities", "Injuries")) %>%
  filter(FATALITY_SEX != "") %>%
  gather(key = "casualty", value = "count", Fatalities:Injuries) %>%
  group_by(FATALITY_AGE,FATALITY_SEX, casualty) %>%
  summarise(counts=sum(count)) %>%
  ungroup() %>%
  ungroup() 
p3=ggplot(aes(x = casualty, y=total_counts), data = casualties_total) + geom_col(aes(fill = casualty)) +
  scale_fill_manual(values=getPalette(2), guide = guide_legend(title = "Casualty Type"))+
  xlab("Casualty Type")+
  ylab("Number of Total Casualties")+
  scale_x_discrete(labels=c("Fatalities","Injuries")) +
  scale_y_sqrt(lim=c(0,600000),breaks=c(1000, 10000, 100000, 500000)) +
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title =element_text(size=12), legend.text=element_text(size=10), legend.title=element_text(size=12))+
  guides(fill=F)

# +1 transformation of x-values to cope with log10 transformation of x-axis (this will include x-values = 0) 
p4=ggplot(aes(x = count+1), data = casualties_by_category) + 
  geom_histogram(aes(fill=casualty),binwidth=0.04) +
  # breaks are each shifted +1 because of +1 transformation of x-values, however, labels are -1 rescaled to represent the correct values on the x-axis
  scale_x_log10(lim=c(0.7,10000),breaks=c(1,2,11,71,162,1001,10001), labels=c(0,1,10,70,161,1000,10000)) +
  xlab("Number of Casualties") +
  ylab("Counts") +
  scale_y_log10(breaks=c(10,100, 1000, 10000, 100000, 1000000)) +
  scale_fill_manual(values=getPalette(2), guide = guide_legend(title = "Casualty Type")) +
  scale_color_manual(name = "Events with casualties", values=c("95th percentile"="black", "5th percentile"="black")) +
  scale_linetype_manual(name = "Events with casualties", values = c("95th percentile"=1, "5th percentile"=2)) +
  theme_minimal() +
  theme(strip.text = element_text(size=12), axis.text = element_text(size = 10), axis.title =element_text(size=12), legend.text=element_text(size=10), legend.title=element_text(size=12))+
  facet_wrap(~casualty)
  

# add lines depicting 5th and 95th percentile for events with recorded casualties (excludes events with no casualty records)
p4 = p4 + geom_vline(data =subset(casualties_by_category, casualty=="Fatalities" & count !=0) ,aes(color = "95th percentile", xintercept = quantile(count,probs=0.95), linetype = "95th percentile" )) + 
  geom_vline(data = subset(casualties_by_category, casualty=="Fatalities" & count !=0) ,aes(color = "5th percentile", xintercept = quantile(count,probs=0.5), linetype = "5th percentile" ))

p4 = p4 + geom_vline(data =subset(casualties_by_category, casualty=="Injuries" & count !=0) ,aes(color = "95th percentile", xintercept = quantile(count,probs=0.95), linetype = "95th percentile" ) ) + 
  geom_vline(data = subset(casualties_by_category, casualty=="Injuries" & count !=0) ,aes(color = "5th percentile", xintercept = quantile(count,probs=0.5), linetype = "5th percentile" ) ) 

grid.arrange(p4,p3,widths=c(22,6), ncol=2, top=textGrob("Distribution of Casualty Counts",  gp=gpar(fontsize=14, face = "bold")))

The distribution of casualties is highly right skewed for both fatalities and injuries. For more than 1000000 of the total 1442330 extreme weather events no casualties were recorded. For weather events with recorded casualties, the number of fatalities ranges from 1-161 whereas the number of injuries ranges from 1-70 (range represents 90% of data with recorded casualties). Moreover, both categories are associated with a smaller number of events that have significantly higher casualty numbers compared to the majority of events in the cognate group. E.g., events with approximately 800 casualties and almost 5000 casualties in the Fatalities and Injuries group, respectively. With predominantly 2-100 incidents per event, the results on the distribution of casualty numbers are somewhat deceiving in terms of perception of total casualty numbers. However, if we depict the sum of all casualties, the full scope of the consequences associated with extreme weather events becomes much clearer. Over 500000 injuries and fatalities each were recorded for the period 1950-2017!!

Now let’s include demographic statistics and analyse casualties by age and gender. For different age groups, I would expect to observe casualty numbers that follow the general population distribution for the United States.

ggplot(aes(x=FATALITY_AGE, y =counts, group=FATALITY_SEX), data = casualties_by_age_gender) + 
  ggtitle("Casualties by Age and Gender") +
  geom_line(aes(color=FATALITY_SEX),stat = "summary", fun.y = sum) +
  scale_y_sqrt() +
  scale_x_continuous(lim=c(0,100), breaks=seq(0,100,5))+
  xlab("Age (Years)")+
  ylab("Number of Casualties")+
  geom_smooth(aes(color=FATALITY_SEX)) +
  scale_color_manual(values=(getPalette(2)), guide = guide_legend(title = "Gender"), labels=c("Female", "Male"))+
  facet_wrap(~casualty) +
  theme_minimal()+
  theme(strip.text = element_text(size=12), plot.title= element_text(hjust=0.5, size = 14, face = "bold"),  axis.text = element_text(size = 10), 
        axis.title =element_text(size=12),legend.text=element_text(size=10), legend.title=element_text(size=12)) 

# for each casualty type (injuries vs. fatalities), calculates the Pearson Correlation Coefficient for age up to 70 and 70 or older, respectively
cor_fat_0_70 = tidy(with(subset(casualties_by_age_gender, casualties_by_age_gender$FATALITY_AGE <= 70 & casualties_by_age_gender$casualty == "Fatalities"), cor.test(FATALITY_AGE,counts)))
cor_inj_0_70 = tidy(with(subset(casualties_by_age_gender, casualties_by_age_gender$FATALITY_AGE <= 70 & casualties_by_age_gender$casualty == "Injuries"), cor.test(FATALITY_AGE,counts)))
cor_fat_70_ = tidy(with(subset(casualties_by_age_gender, casualties_by_age_gender$FATALITY_AGE > 70 & casualties_by_age_gender$casualty == "Fatalities"), cor.test(FATALITY_AGE,counts)))
cor_inj_70_ = tidy(with(subset(casualties_by_age_gender, casualties_by_age_gender$FATALITY_AGE > 70 & casualties_by_age_gender$casualty == "Injuries"), cor.test(FATALITY_AGE,counts)))
df_cor = data.frame() %>% bind_rows(cor_fat_0_70,cor_inj_0_70,cor_fat_70_,cor_inj_70_)
rownames(df_cor) = c("Fatalities age 0-70", "Injuries age 0-70", "Fatalities age 70+", "Injuries age 70+")
df_cor
##                       estimate statistic                           p.value
## Fatalities age 0-70  0.7323805 12.726831 0.0000000000000000000000003908692
## Injuries age 0-70    0.5085346  6.988120 0.0000000001039410108505342153734
## Fatalities age 70+  -0.6184963 -6.096878 0.0000000845980723155704863235678
## Injuries age 70+    -0.6286802 -6.262005 0.0000000446682687877821317499811
##                     parameter   conf.low  conf.high
## Fatalities age 0-70       140  0.6455322  0.8005295
## Injuries age 0-70         140  0.3752411  0.6212224
## Fatalities age 70+         60 -0.7520823 -0.4360966
## Injuries age 70+           60 -0.7592307 -0.4494934
##                                                   method alternative
## Fatalities age 0-70 Pearson's product-moment correlation   two.sided
## Injuries age 0-70   Pearson's product-moment correlation   two.sided
## Fatalities age 70+  Pearson's product-moment correlation   two.sided
## Injuries age 70+    Pearson's product-moment correlation   two.sided

As evident from figure Casualties by Age and Gender and the results from Pearson’s correlation test, the casualty numbers are positively correlated with age up to the age of approximately 70. From the age 0f 70 onward the correlation between casualty numbers and age is negative. This is true for both categories (Fatalities and Injuries), however, the positive correlation is stronger for Fatalities. Those trends could merely reflect age demographics for the United States or alternatively imply differences in risk for different age groups. So let’s include data on age distribution in the United States to clarify on the aforementioned statement.

# group casualties by age groups
casualties_by_age_gender["Age_groups"] = as.character(cut(casualties_by_age_gender$FATALITY_AGE, breaks=c(0,19, 39, 59,79, 89, max(casualties_by_age_gender$FATALITY_AGE))))

# read the census data (combined male and female data from 1990-2017) and calculate average population percentage for each age group
census_data = read.csv("census_usa.csv", check.names = F)
census_data_mean = apply(census_data[,-1],2,function(x)mean(x)) 

# convert vector namess from census_data_mean to same names used in "Age_groups" variable of casualties_by_age_gender dataset; This way, we can use the hashed form of census data to incorporate
# the population percentage for each corresponding Age group in casualties_by_age_gender dataset
names(census_data_mean)[order(names(census_data_mean))] = pull(unique(na.omit( casualties_by_age_gender["Age_groups"])))

# make a python dictionary-like object with age groups as keys and the corresponding average population percentage as values
census_data_mean_dict = hash()
for(age_group in names(census_data_mean)){
  census_data_mean_dict[age_group] = census_data_mean[age_group]
}

# function used to retrieve population percentage for each corresponding Age group
census_classification = function(group) {
  if(is.na(group)){
    output = NA
  }
  else {
    output = census_data_mean_dict[[group]]
  }
  output
}

# include the corresponding population percentage from census data for the cognate age group in the data set
casualties_by_age_gender["census"] = apply(subset(casualties_by_age_gender, select = c("Age_groups")),1, function(x) census_classification(x))
casualties_by_age_groups = subset(casualties_by_age_gender,!is.na(casualties_by_age_gender$Age_groups)) %>%
  group_by(Age_groups,census) %>%
  summarise(group_counts = sum(counts)) %>%
  transform(expected_counts = (census/100)*sum(group_counts) ) %>%
  ungroup() %>%
  select(c("Age_groups", "group_counts", "expected_counts"))

# reshape data into long format to plot observed counts and expected counts
casualties_by_age_groups_reshape = melt(casualties_by_age_groups)

# calculate the relative risk for each age group
# 1. take the bigger of expected values and observed values as numerator and calculate the ratio 
# 2. convert ratio to negative value if the expected value is bigger than the observed values
# 3. decreased risk will have ratio value < 1 and increased risk will have ratio value > 1 
casualties_by_age_groups["ratio"] = apply(casualties_by_age_groups[,2:3],1,function(x){max(x)/min(x)})
casualties_by_age_groups = casualties_by_age_groups %>%
  transform(ratio = ifelse(expected_counts > group_counts,-ratio,ratio ) ) 
p5=ggplot(aes(x=Age_groups, y=value, group=variable), data = casualties_by_age_groups_reshape) + geom_point(aes(shape = variable, fill = variable),size=6) +
  geom_line(aes(linetype=variable, size=variable)) +
  xlab("Age Groups") +
  ylab("Number of Casualties (Fatalities and Injuries)") +
  scale_x_discrete(labels=c("0-19","20-39","40-59","60-79","80-89","90+")) +
  scale_linetype_manual(values=c(1,2), guide = guide_legend(title = ""), labels=c("Observed Counts","Expected Counts")) +
  scale_size_manual(values=c(1, 1.5), guide=F)+
  scale_shape_manual(values=c(21,23), guide=F) +
  scale_fill_manual(values=(getPalette(2)), guide=F) +
  theme_minimal() +
  theme(axis.text = element_text(size = 10),  
        axis.title =element_text(size=12), legend.text=element_text(size=10))

p6=ggplot(aes(x=Age_groups, y=ratio), data = casualties_by_age_groups) + geom_col(aes(fill = ifelse(ratio < 0,"Decreased Risk","Increased Risk") )) +
  scale_y_continuous(lim=c(-12,12), breaks = seq(-12,12,4), labels=c("-12","-8","-4","1","4","8","12"))  +
  scale_x_discrete(labels=c("0-19","20-39","40-59","60-79","80-89","90+")) +
  xlab("Age Groups") +
  ylab("Relative Risk") +
  scale_fill_manual(values=c("darkgreen","red"), guide = guide_legend(title = ""))+
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title =element_text(size=12), legend.text=element_text(size=10), panel.grid.minor = element_blank())
   
grid.arrange(p5,p6,ncol=2, widths=c(6,5), top=textGrob("Relative Risk for Age-Groups",  gp=gpar(fontsize=14, face = "bold")))

If we take the average age distribution in the United States based on census data from 1990 to 2017 (http://www.populationpyramid.net/united-states-of-america/; 0-19 27.6%, 20-39 28.9%, 40-59 19%, 60-79 14.2%, 80-89 2.8%, 90+ 0.57%) as the reference and assume independence between age and risk of accidents, we would expect on average highest and lowest casualty numbers for age groups 0-19, 20-39, 40-59 and 80-89, 90+, respectively. Intermediate Casualty numbers should be expected for age group 60-79. However, this is not consistent with actual data, which shows much lower than expected casualty numbers for age groups 0-19 and 20-39 and much higher than expected casualty numbers for age groups 60-79, 80-89 and 90+ (Figure Relative Risk for Age-Groups left plot). Plotting the corresponding relative risk ratio (the larger value divided by the smaller value for the corresponding pair of observed counts and expected counts; negative sign indicates higher values for expected counts as opposed to observed counts) reveals that the former groups have decreased casualty risk, whereas the latter age groups have increased casualty risk. Only the age group 40-59 has casualty numbers that are close to the expected counts. Why?? (direct vs. Indirect casualties, older people more prone to die from injuries; here, one could analyse casualties separately for injuries and fatalities and furthermore analyse direct vs. indirect casualties)

Cost incurred by extreme weather

From the first glimpse at the storm events data set I noticed that values for DAMAGE_PROPERTY and DAMAGE_CROPS are encoded as digit-letter abbreviations, such as 10K, which equals 10000. For the purpose of summing the amounts over a month or a year, it is necessary to transform each value into the corresponding number. First, let’s check the unique set of characters used for abbreviation (e.g. “K” for ten thousand), so we know what numerical factors are required for conversion into numbers.

head(df_events_victims$DAMAGE_PROPERTY)
## [1] "250K" "25K"  "25K"  "2.5K" "2.5K" "2.5K"
# use regex to extract the character in abreviation and return unique 
# if required, transform factor type to character type for regex check
regexpr = regexec("[aA-zZ]", c(df_events_victims$DAMAGE_CROPS, df_events_victims$DAMAGE_PROPERTY))
unlist(unique(regmatches(c(df_events_victims$DAMAGE_CROPS, df_events_victims$DAMAGE_PROPERTY), regexpr)))
## [1] "K" "M" "k" "T" "B" "h" "H"

The unique set comprises of the letters K, M, T, H, B and a no match (character(0); omit unlist() to return no match). Since I used word character a-z for the regular expression, the no match means that the corresponding value is either missing or its representation excludes any character. Missing values and values without letter will be converted to NA, because we can’t deduce the corresponding real number value. From the documentation, we know that “K” and “M” are abbreviations for “thousand” and “million”, respectively. “B” is the canonical abbreviation for “billion”. However, the meaning for “H” and “T” is somewhat unclear. I assume that “T” refers to “trillion” (canonical letter abbreviation in finance), whereas “H” is equivalent for “hundred” (from hecto). But nevertheless, let’s extract the corresponding values containing T or h/H and use the number part of the string to help decide whether T and h/H really correspond to trillion and hundred, respectively. I expect the corresponding number to be in the range of tenth-thousandth (0.1-0.001) for Trillion, because otherwise the amount would be simply too big to be true. On the other hand, I expect the corresponding number for h/H values to be in the range of thousand and bigger (1000-), since I doubt that costs of 1000 dollars or lower, would have been recorded. Of course, my assumption might as well be wrong.

# adapt apply by changing select argument to "DAMAGE_CROPS" and regexec/regmatches pattern argument to "T" or "h"; returns boolean vector that can be used to return the numeric part of damage entries with appropriate character (here, "H") in abreviation
check = apply(subset(df_events_victims, select = "DAMAGE_PROPERTY"),1,function(x) {  ifelse(length(unlist(regmatches("H", regexec("H", x)))) ==1,TRUE,FALSE)  } )
unique(df_events_victims$DAMAGE_PROPERTY[check] )
## [1] "2H" "5H"

Since there is only one hit for “T” with 0 as the corresponding numeric value (“0T”), we don’t need to decide about the true meaning of “T”. However, the hits for “h”/“H” are of magnitude ten (“2h”,“5h”,“5H)”. As such, I revise my assumption for “h” and “H” to rather represent “hundred thousand” instead of “thousand”.

value_transformation = function(value){
  # there seems to be a bug with the hash() function. Using the standard definition of the hash object hash(key = value) with h as key returns an error (The empty character string, '', cannot be used for a key at key(s): 1). Somehow this is only true for h as key. This problem can be circumvented by using the .set() notation
  letter_value = hash()
  .set(letter_value, keys=c("K","k","M","B","T","h","H"), values=c(1000,1000,1000000,1000000000,1000000000000,100000,100000))
  # use regex to extract the letter from digit-letter abreviation
  regexpr_letter = regexec("[aA-zZ]", value)
  match_letter = unlist(regmatches(value, regexpr_letter))
   # use regex to extract the digit from digit-letter abreviation
  regexpr_digit = regexec("\\.?[0-9]+", value)
  match_digit = as.numeric(unlist(regmatches(value, regexpr_digit)))
  
  if(value == "0"){
    output = 0
  }
  # check for digit and letter match in corresponding damage entry; both required for valid transformation
  else if(length(match_letter) == 1 & length(match_digit) == 1){
    # convert to numeric value
    output = letter_value[[match_letter]]*match_digit
  }else{
    output = NA
  }
  output
}

# transformation of value_transformation() with Vectorize() is required when using dplyr's mutate_at, which takes vectors as input
value_transformation = Vectorize(value_transformation, USE.NAMES = F)
df_events_victims = df_events_victims %>% mutate_at(vars(DAMAGE_PROPERTY,DAMAGE_CROPS),funs(value_transformation))

# convert dataset including only the damage variables to long format (used for plotting the distribution of damage amount)
damage_reshape = gather(subset(df_events_victims, select = c("DAMAGE_PROPERTY", "DAMAGE_CROPS")), key = "damage_category", value = "amount", c("DAMAGE_PROPERTY", "DAMAGE_CROPS"))
# calculates the total damage amount 
total_damage_by_category = subset(damage_reshape, !is.na(damage_reshape$amount)) %>% group_by(damage_category) %>% summarise(total = sum(amount))
# +1 transformation of x-values to cope with log10 transformation of x-axis (this will include x-values = 0 )
p7 = ggplot(aes(x=amount+1), data = subset(damage_reshape, !is.na(amount))) + geom_histogram(aes(fill = damage_category),binwidth = 0.06) +
  xlab("Total Amount ($)") +
  ylab("Counts") +
  theme_minimal() +
  # breaks are each shifted +1 because of +1 transformation of x-values, however, labels are -1 rescaled to represent the correct values on the x-axis
  scale_x_log10(breaks = c(1, 101, 1001, 10001, 100001, 1000001, 3622641, 10000001, 100000001), labels=c(0, 100, 1000, 10000, 100000, 1000000, 3622640, 10000000, 100000000)) +
  scale_y_log10() +
  scale_fill_manual(name = "Damage Category", values=getPalette(2))+
  scale_color_manual(name = "statistics", values=c("mean"="black", "median"="black")) +
  scale_linetype_manual(name = "statistics", values = c("mean"=1, "median"=2)) +
  theme(strip.text = element_text(size=12), axis.text.x = element_text(angle = 45, hjust = 1,size = 10), axis.text.y = element_text(size = 10),  
        axis.title =element_text(size=12), legend.text=element_text(size=10), legend.title=element_text(size=12))+
  facet_wrap(~damage_category) 

# add lines representing mean and median damage amounts (+1 transformation of median, because median = 0; otherwise the line would not intersect 0 in the plot)
p7 = p7 + geom_vline(data =damage_reshape[which(damage_reshape$damage_category == "DAMAGE_PROPERTY"),] ,aes(color = "mean", xintercept = mean(damage_reshape$amount[!is.na(damage_reshape$amount) &  damage_reshape$damage_category == "DAMAGE_PROPERTY"]), linetype = "mean" ) ) + 
  geom_vline(data =damage_reshape[which(damage_reshape$damage_category == "DAMAGE_PROPERTY"),] ,aes(color = "median", xintercept = median(damage_reshape$amount[!is.na(damage_reshape$amount) &  damage_reshape$damage_category == "DAMAGE_PROPERTY"]+1), linetype = "median" ) )

p7 = p7 + geom_vline(data =damage_reshape[which(damage_reshape$damage_category == "DAMAGE_CROPS"),] ,aes(color = "mean", xintercept = mean(damage_reshape$amount[!is.na(damage_reshape$amount) &  damage_reshape$damage_category == "DAMAGE_CROPS"]), linetype = "mean" ) ) +
  geom_vline(data =damage_reshape[which(damage_reshape$damage_category == "DAMAGE_CROPS"),] ,aes(color = "median", xintercept = median(damage_reshape$amount[!is.na(damage_reshape$amount) &  damage_reshape$damage_category == "DAMAGE_PROPERTY"]+1), linetype = "median" ) )

  
p8 = ggplot(aes(x = damage_category, y=total/1000000000), data = total_damage_by_category) + geom_col(aes(fill = damage_category)) +
  ylab("Amount (Billion $)") +
  xlab("Damage Category")+
  theme_minimal() +
  theme(axis.text = element_text(size = 10), axis.title =element_text(size=12)) +
  scale_x_discrete(labels=c("CROPS","PROPERTY")) +
  scale_fill_manual(values=getPalette(2))+
  scale_y_sqrt(breaks = c(69, 1000, 2000, 3000, 3445)) +
  guides(fill=F)

grid.arrange(p7,p8, ncol=2, widths = c(22,6),  top=textGrob("Distribution of Damage Amount incured by Extreme Weather",  gp=gpar(fontsize=14, face = "bold")))

The distribution of costs incurred by extreme weather is highly right skewed for both damage categories (crops vs. property). For the vast majority of events no costs were incurred, which is also reflected by a median of 0 for both damage categories. Overall, there are more extreme weather events recorded that incurred higher costs on property as opposed to crops. This is reflected by a larger mean for costs on property damage. Interestingly, irrespective of damage category incurred costs seem to form 4-5 different clusters, which are: no costs, small costs (100-1000 USD), medium-higher costs (10000-1000000 USD) and high-extreme costs (double-digit millions-billions). The plot on the right side in figure 3 shows the total costs incurred over the period 1950-2017. Overall, with over 3000 billion USD the total costs for property damage are much higher than for damage on crops.

What are regions of “highest risk” for extreme weather?

Let’s have a look at the accumulated damage amount, the total number of casualties and the frequency of extreme weather events for each state over the period from 1950-2017. The distribution of casualty numbers is highly skewed with most values in the low range and a few very large values. For better visualization results, the casualty numbers will be binned into appropriate groups, representing different magnitudes of casualties (very low, low, medium, etc.). Alternatively, log10 transformation would also be suitable.

# transform states/county data to data frame with fortify(); fortify takes a model, which you want to convert, and region defines the variable by which to split the data
# fips codes essential for county-specific mapping, because county names aren't unique
states = fortify(usa_composite(),region="fips_state") 
cmap = fortify(counties_composite(), region="fips")

# combine property (DAMAGE_PROPERTY) and crops damage (DAMAGE_CROPS) to calculate the total damage for each event
# this conversion will be required in other reshaped variants of the main data frame df_events_victims. 
df_total_damage = df_events_victims %>%
  # converts NA to 0 for one damage category if other category is not NA; This avoids NA result when summing number and NA 
  mutate_at(vars(DAMAGE_PROPERTY), funs(ifelse(!is.na(DAMAGE_CROPS) & (is.na(DAMAGE_PROPERTY) | DAMAGE_PROPERTY==0),0, DAMAGE_PROPERTY ))) %>%
  mutate_at(vars(DAMAGE_CROPS), funs(ifelse(!is.na(DAMAGE_PROPERTY) & (is.na(DAMAGE_CROPS) | DAMAGE_CROPS==0),0, DAMAGE_CROPS ))) %>%
  transform(damage_combined = DAMAGE_PROPERTY+DAMAGE_CROPS) %>%
  # remove NA to avoid NA result when summing number and NA
  filter(!is.na(damage_combined))

# returns total damage amount for each state
total_damage_by_state = subset(df_total_damage, select = c("STATE","STATE_FIPS","damage_combined")) %>%
  # state fips have two-digit format; formatC() preserves 2-digit format and adds a 0 if entry is of 1-digit type
  transform(FIPS = formatC(STATE_FIPS, digits=2,width=2,format = "d", flag=0)  ) %>%
  filter(FIPS %in% states$id) %>%
  group_by(STATE,FIPS) %>%
  summarise(total_damage = sum(damage_combined)) %>%
  ungroup()

# returns total casualty number for each state
total_casualty_by_state = subset(casualties_reshape, select=c("STATE", "FIPS", "Fatalities", "Injuries")) %>%
  transform(casualties_combined = Fatalities+Injuries) %>%
  filter(!is.na(FIPS)) %>%
  group_by(STATE,FIPS) %>%
  filter(FIPS %in% states$id) %>%
  summarise(total_casualties = sum(casualties_combined)) %>%
  ungroup()

# define discrete groups for casualty numbers
total_casualty_by_state["casualty_bins"] = cut(total_casualty_by_state$total_casualties, breaks=c(0,100,1000,10000,50000,100000,500000), dig.lab = 6)  

# return geographic coordinates for each extreme weather event; required to calculate the probability density function for events in US and to represent probabilities of extreme weather for different geographic locations (via stat_density2d() function)
event_coordinates = subset(df_events_victims, select=c("BEGIN_LAT", "BEGIN_LON" )) %>% 
  # filter for the 48 contiguous states (- Alaska and Hawaii) to avoid clipping artifacts of polygons in stat_density2d() function
  filter(BEGIN_LAT>=24.396308 & BEGIN_LAT<=49.384358) %>%
  filter(BEGIN_LON > -124.848974, BEGIN_LON < -66.885444)
p9 = ggplot()+
  geom_map(data=states,map=states,aes(x=long,y=lat, map_id =id), color ="black", size=0.4,fill=NA) +
  geom_map(data = total_casualty_by_state, map=states, aes(fill= casualty_bins, map_id=FIPS), na.value="white") +
  ggtitle("Total Casualties (1950-2017)") +
  # the log10 for the total damage min value is 8.094 
  scale_fill_viridis(name = "Number of Casualties", option = "A", discrete = TRUE, labels = c("< 100", "100-1000", "1000-10000", "10000-50000","50000-100000", "> 100000"),
                     guide =guide_legend(title.position = 'top', title.hjust = 0.5)) +
  theme_minimal(base_size = 12) +
  theme(axis.line = element_blank(), axis.text = element_blank(),
    axis.ticks = element_blank(), axis.title = element_blank(), plot.title= element_text(hjust=0.5, size = 14, face = "bold"),
    panel.grid.major = element_blank(), legend.position = "bottom", legend.text=element_text(size=12), legend.title=element_text(size=14) ) +
  coord_map()


p10 = ggplot()+
  geom_map(data=states,map=states,aes(x=long,y=lat, map_id =id), color ="black", size=0.4,fill=NA) +
  geom_map(data = total_damage_by_state, map=states, aes(fill=log10(total_damage), map_id=FIPS), na.value="white") +
  ggtitle("Total Damage (1950-2017)") +
  # the log10 for the total damage min value is 8.094 
  scale_fill_viridis(name = "Amount in $, log scale", option = "A", 
    limits=c( log10( 10**8 ), log10(max(total_damage_by_state$total_damage))),
    breaks=c(log10(10**8),log10(10**9),log10(10**10),log10(10**11),log10(10**12)),
    labels=c("$100 Million","$1 Billion", "$10 Billion","$100 Billion", "$1000 Billion"),
    guide = guide_colorbar(direction = "horizontal",barheight = unit(2, units = "mm"),barwidth = unit(50, units = "mm"), title.position = 'top', title.hjust = 0.5, label.hjust = 1.0, label.vjust = 1.0 )) +
  theme_minimal(base_size = 12) +
  theme(axis.line = element_blank(), axis.text = element_blank(),
    axis.ticks = element_blank(), axis.title = element_blank(), plot.title= element_text(hjust=0.5, size = 14, face = "bold"),
    panel.grid.major = element_blank(), legend.position = "bottom", legend.text=element_text(angle = 45, size=12), legend.title=element_text(size=14)) +
  coord_map()


p11 = ggplot()+
  geom_map(data=states,map=states,aes(x=long,y=lat, map_id =id), color ="black", size=0.4,fill=NA) +
  stat_density2d(aes(y=BEGIN_LAT, x=BEGIN_LON, fill = ..level.., alpha = ..level.. ), size = 0.01, bins = 20, data = event_coordinates, geom = 'polygon')+
  scale_alpha(guide = FALSE)+
  ggtitle("Extreme Weather Probability Estimate") +
  scale_fill_viridis(name = "Density Level",option = "A",
    guide = guide_colorbar(direction = "horizontal", barheight = unit(2, units = "mm"),barwidth = unit(50, units = "mm"), title.position = 'top', title.hjust = 0.5, label.hjust = 1.0, label.vjust = 1.0))+
  theme_minimal(base_size = 12) +
  theme(axis.line = element_blank(), axis.text = element_blank(),
    axis.ticks = element_blank(), axis.title = element_blank(), plot.title= element_text(hjust=0.5, size = 14, face = "bold"),
    panel.grid.major = element_blank(), legend.position = "bottom", legend.text=element_text(angle = 45,size=12), legend.title=element_text(size=14)) +
  coord_map()
  
plot_grid(p9,p10,p11, align = "hv", ncol = 3)

The density plot (figure on the right) depicts the density of events as a function of geographical coordinates, which basically represents the probability estimate for occurrences of combined weather events (based on data from 1950-2017) for a particular area. It shows highest probabilities for the North East, Mid Atlantic, South East, Mid West, Mid South and large parts of the Central region. If we zoom in, we can identify true hotspots (patches with density level >= 0.004), e.g., the most parts of Oklahoma and Kansas, parts of Texas or the region encompassing New Jersey, Maryland, Washington, DC and Delaware. So those areas seem to have the highest probability or risk of being confronted with extreme weather. As expected, event density correlates well with total damage amount and total casualties. Those states or regions with highest probability tend to have higher monetary damage and higher casualty numbers caused by extreme weather. However, we also observe some irregularities with respect to this correlation. E.g., Louisiana, California and Florida exhibit very high damage costs but proportionally lower event densities. Similarly, casualty numbers and event density seem to be disproportional for Nevada. This rather implicates that those states experienced “fewer” but much more severe events as opposed to other states, albeit high damage in states with high event frequency might also be attributable to few severe events. An overall higher extreme weather probability does not necessary implicate higher probability for most severe events!

Temporal change in occurrence of extreme weather events

We know that the geographical distribution of extreme weather seems to be highly biased towards the eastern part of the United States if we take the accumulated events over the past 70 years. Equally interesting, however, is the change in event frequency over time. Do we observe any obvious increase or decrease in frequency for different event categories?

# load US cities data and filter for cities with 500.000 population or greater
data(us.cities)
us_cities = us.cities 
us_cities = us_cities %>%
  filter(pop >= 500000)
# function to create GIFs that show the number of events over time (for 5 year periods)
myplot_GIF = function(period, event_filter="All Combined", guide=T) {
  # period (string): decades_buckets values (specifies the period for which the number of events should be plotted)
  # event_filter (string): specifies which storm event category to use from the df_events_victims dataset (e.g. Tornado); no specification (all categories combined) is the default 
  # guide (bool): specifies whether legend should be included in the plot 
  storm_dataset = subset(df_events_victims, select=c("STATE","STATE_FIPS", "CZ_FIPS", "decades_buckets", "EVENT_CATEGORY")) %>%
  mutate_at(vars(STATE), funs(tolower)) %>%
  transform(FIPS = paste(formatC(STATE_FIPS, digits=2,width=2,format = "d", flag=0), formatC(CZ_FIPS, digits=3,width=3,format = "d", flag=0), sep="")  ) %>%
  transform(FIPS=as.character(FIPS)) %>%
  select(c("STATE", "FIPS", "decades_buckets", "EVENT_CATEGORY")) 
  
  # calculate the 90th percentile of event count distribution (either for a specified category or for combined events) to define the upper limit for number of events in the plots
  if(event_filter != "All Combined"){
    upper_limit = storm_dataset %>%
      group_by(FIPS,decades_buckets, EVENT_CATEGORY) %>%
      filter(EVENT_CATEGORY == event_filter ) %>%
      summarise(counts = n()) %>%
      ungroup()
  }else{
    upper_limit = storm_dataset %>%
      group_by(FIPS,decades_buckets) %>%
      summarise(counts = n()) %>%
      ungroup()  
  }
  upper_limit = quantile(upper_limit$counts,probs = 0.9)
  
  # determine counts for combined events, or group by specified category
  if(event_filter != "All Combined") {
    storm_dataset = storm_dataset %>% filter(EVENT_CATEGORY == event_filter) %>%
    group_by(FIPS,decades_buckets, EVENT_CATEGORY)
  }else{
  storm_dataset = storm_dataset %>%
  group_by(FIPS,decades_buckets)
  }
  storm_dataset = storm_dataset %>%
  filter(decades_buckets == period) %>%  
  summarise(events = n())
  storm_dataset = storm_dataset %>%
  transform(events = ifelse(events > upper_limit, upper_limit, events))  %>%
  ungroup() %>%
  ungroup() 
  
  # plot reshaped data 
  ggplot()+
  # use regex to transfor period from "(year,year]" format (standard cut() output) into "year-year" format (more appealing)
  ggtitle(gsub('^(\\()([0-9]{4}),([0-9]{4})(\\])$', '\\2-\\3', period)) +
  geom_map(data=cmap, map=cmap, aes(x=long, y=lat, map_id =id), fill="white", color="black", size = 0.05) +
  geom_map(data = storm_dataset, map=cmap, aes(fill=events, map_id=FIPS), na.value="white") +
  geom_map(data=states,map=states,aes(x=long,y=lat, map_id =id), color ="black", size=0.8,fill=NA) +
  geom_point(aes(x=long, y=lat,color = ""), data=us_cities, size=2) +  
  coord_map() +
  theme_minimal(base_size = 12) +
  theme(axis.line = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(), axis.title = element_blank(), plot.title= element_text(hjust=0.5, size = 16, face = "bold"),
        panel.grid.major = element_blank(), legend.position = "bottom", legend.text=element_text(size=14), legend.title=element_text(size=14)) +
  scale_color_manual(values = "red" , guide = guide_legend(title = "Over 0.5 Mio. Population") ) +  
  scale_fill_viridis(name = "Number of Events", option="D",na.value="white", limits=c(1,upper_limit), 
      guide = guide_colorbar(direction = "horizontal",barheight = unit(2, units = "mm"),barwidth = unit(50, units = "mm"), title.position = 'top', title.hjust = 0.5 )) +
    
  if(guide == F){
    guides(fill=F, color = F) 
  }
}

# convert legend into plotting objects in order to individualize multiplots
get_legend = function(plot){
    # build all plot grobs
    grob_table = ggplot_gtable(ggplot_build(plot))
    # identify list in grob_table that stores information about the legend
    legend_list = which(sapply(grob_table$grobs, function(x) x$name) == "guide-box")
    # extract legend object (contains all information about the plot)
    legend = grob_table$grobs[[legend_list]]
    return(legend)
}

make_GIF = function(event){
  # get legend plot
  legend = get_legend(myplot_GIF("(2010,2015]", event))
  # adjust legend x and y axis orientation on plotting canvas
  legend$vp$x = unit(1.5, 'npc')
  legend$vp$y = unit(1.5, 'npc')  
  
  oopt = ani.options(interval = 1)
  saveGIF(
    {for (period in unique(df_events_victims$decades_buckets[!is.na(df_events_victims$decades_buckets)]) ) {
    time_series = myplot_GIF(period, event,F)
    first_record = myplot_GIF("(1949,1955]",event,F)
    last_record = myplot_GIF("(2010,2015]",event,F)
  
    print(grid.arrange(first_record,last_record,time_series,legend, heights=c(10,1),widths =c(1,1,1),ncol=3, top=textGrob(paste("Number of Extreme Weather Reports for the Category",event),  gp=gpar(fontsize=24))))
    ani.pause()
    }},movie.name = paste("Extreme_Weather_timeseries_", event, ".gif", sep=""),ani.width = 1500, ani.height = 500)
}
# make GIF files for different Weather categories
for(event in c("Tornado", "Hail", "Storm", "All Combined")){
  make_GIF(event)
}
knitr::include_graphics('Extreme_Weather_timeseries_Tornado.gif')

knitr::include_graphics('Extreme_Weather_timeseries_Storm.gif')

knitr::include_graphics('Extreme_Weather_timeseries_Hail.gif')

knitr::include_graphics('Extreme_Weather_timeseries_All combined.gif')

I have combined events into 5 year bins in order to better detect any differences in event frequency changes. In this context, the data was restricted to the 90th percentile for each category. Again, this allows a better representation of data, as outliers with extremely high event count would “inflate” the scale and, thus, reduce the comparability between the time periods. The years 2016 and 2017 have been removed for consistency. Furthermore, the analysis was conducted for the event categories Tornado/Hail/Storm and all categories combined, respectively. Overall the time series analysis shows a continuous increase in total extreme weather occurrences. This is also true for each category. The plots/GIF representing Tornado activity clearly depicts the “Tornado Alley”, which is the area where Tornadoes are most frequent, encompassing large parts of Central United States (https://en.wikipedia.org/wiki/Tornado_Alley). Interestingly, the frequency of Hail events in later periods (After 1995) strongly resembles the “Tornado Alley”, however, shifted more towards the north. The frequency of Storms seems to have increased especially for coastal regions in the Southwest, North- and Southeast and the area around the great lakes. Additionally I have included US cities with 500.000 or more population. The aim was to analyse whether the risk (more occurrences) for extreme weather events increased over time for those major hubs that harbour most of the total US population. The data clearly shows that irrespective of weather category, the major cities fall in areas of highest event frequency!!

Correlation between temperature, carbon dioxide emission and extreme weather

First of all, let’s analyse the 20th century Temperature profile on different scales - Global, USA, USA States -. Full coverage of US states temperature measurements is not present in the temperature dataset used in this study until the late 19th century. For the purpose of consistency, the analysis will focus on the 20th century time period. Additionally, the time-dependent Carbon dioxide emission and atmospheric Carbon dioxide concentration will be analysed.

# Load temperature data and transform State var to uppercase type, define columns Year and Month to store year and month of measurement. Columns State, Year and Month will be used to reference  corresponding Temperature to Year and corresponding State Storm Events (here, states in all uppercase!!)
temp_data = read.csv("GlobalLandTemperaturesByState.csv")
colnames(temp_data)[2] = "MonthAvgTemp_STATE"
temp_data = temp_data %>%  
              mutate(Year = as.Date(dt), Year = format(Year,"%Y"), Year = as.integer(Year)) %>% # Year will be used to refer Temperature to df with Storm Events
              mutate(Month = as.Date(dt), Month = format(Month,"%m"), Month = as.integer(Month)) %>%
              mutate_at(vars(State), funs(toupper))

# calculate global temperature average for 20th century
global_temp = subset(temp_data,!is.na(temp_data$MonthAvgTemp_STATE)) %>% 
                group_by(Year) %>%
                filter(Year >= 1900) %>% # here fiter year
                summarise(global_avg = mean( MonthAvgTemp_STATE))
                

# group temperature by Year and State and calculate the annual average for each state. 
usa_temp = subset(temp_data,!is.na(temp_data$MonthAvgTemp_STATE)) %>%  
              group_by(Year,State, Country) %>% # group by state and year to calculate average temp for each state and year
              filter(Country=="United States", Year >= 1900) %>% # here fiter year
              summarise(YearAvgTemp_STATE=mean(MonthAvgTemp_STATE)) %>%
              ungroup()

# standarize state naming (state naming checked before using table() and unique() functions)
usa_temp = usa_temp %>%
  mutate_at(vars(State), funs(ifelse(State == "GEORGIA (STATE)", "GEORGIA", State)))

# calculate the grand temperature average for 19th and/or 20th century for each state, which will be used as reference for temperature changes over time.
# add state FIPS code (from total_casualty_by_state) for mapping with geom_map
century_average_by_state = usa_temp %>%
  group_by(State)%>%
  summarise(state_century_average = mean(YearAvgTemp_STATE))

# add state century average temperature to global_temp dataset
usa_temp = left_join(usa_temp,century_average_by_state,by = "State")

# group entries by 1/2 decades
usa_temp["decades"] = cut(usa_temp$Year, breaks=c(seq(1899,2005,5),2013), dig.lab=4 )

# subtract each annual state average temperature from the state century average temperature
# use only 20th century data (not for all years/periods temperature data for all states)
usa_temp = usa_temp %>%
  filter(Year >= 1900) %>% # here filter
  transform(temp_deviation_from_century = YearAvgTemp_STATE - state_century_average)

# group by states and time period and calculate the decades average deviation in temperature from the century average temperature
usa_states_temp_by_decades = usa_temp %>%
  group_by(State, decades) %>%
  summarise(avg_temp_deviation = mean(temp_deviation_from_century)) %>%
  ungroup()

# add state FIPS code (from total_casualty_by_state) for mapping with geom_map 
usa_states_temp_by_decades = left_join(usa_states_temp_by_decades, total_casualty_by_state[,1:2], by = c("State" = "STATE"))
 
# cut avg_temp_deviation into bins
usa_states_temp_by_decades["temp_bins"] = cut(usa_states_temp_by_decades$avg_temp_deviation, breaks=c(-1.2, -0.8, -0.6, -0.4, -0.2, 0, 0.4, 0.8, 1.2, 1.6, 2))

# calculate the annual average temperature for the united states (will be used in combination with annual CO2 conc., annual weather count and annual damage)
annual_temp_usa = usa_temp %>%
  group_by(Year) %>%
  summarise(annual_temp = mean(YearAvgTemp_STATE)) 
  
# Load data on atmospheric carbon dioxide concentration, group by Year and calculate the average
global_carbon = read.csv("carbon_dioxide_levels.csv")
global_carbon = subset(global_carbon, !is.na(global_carbon$Carbon.Dioxide..ppm.)) %>% 
                group_by(Year) %>%
                summarise(YearAvgCarbon = mean(Carbon.Dioxide..ppm.))

# skip first 27 lines (contain infromation about the data); data from http://cdiac.ornl.gov/CO2_Emission/timeseries/global
# load the data, filter by year and convert emission values from million tonnes per year to billion tonnes per year (GtC/yr)
global_co2_emission = read.csv("Global_emission_8192.csv", skip = 27)
global_co2_emission = global_co2_emission %>%
  filter(Year >= 1900) %>%
  transform("percentage_increase" = (Total/Total[1])*100-100 )

# calculate the highest US temperature records since 1900
usa_high_temp = vector()
highest_temp_usa = annual_temp_usa$annual_temp[1]
for(i in 1:nrow(annual_temp_usa)){
  if(annual_temp_usa$annual_temp[i] >highest_temp_usa) {
    highest_temp_usa = annual_temp_usa$annual_temp[i]
    usa_high_temp = c(usa_high_temp, 1)
  }else{
    usa_high_temp = c(usa_high_temp, 0)
  }
}
annual_temp_usa = cbind(annual_temp_usa, usa_high_temp)

# calculate the highest global temperature records since 1900
global_high_temp = vector()
highest_temp_global = global_temp$global_avg[1]
for(i in 1:nrow(global_temp)){
  if(global_temp$global_avg[i] > highest_temp_global) {
    highest_temp_global = global_temp$global_avg[i]
    global_high_temp = c(global_high_temp, 1)
  }else{
    global_high_temp = c(global_high_temp, 0)
  }
}
global_temp = cbind(global_temp, global_high_temp)

# Use a Red-Blue color palette to represent low and high temperature for a total of 10 discrete groups
getPalette = colorRampPalette(brewer.pal(10,"RdYlBu"))
# function used to plot temperature deviations from 20th century average for different time periods and for each state 
temp_change_GIF = function(period){
  # period (string): decades_buckets values (specifies the period for which the number of events should be plotted)
  ggplot() + 
  geom_map(aes(x=long, y=lat, map_id = id), data=states, map = states,color ="black", size=0.8, fill=NA)+
  geom_map(aes(fill = temp_bins, map_id = FIPS ), data = subset(usa_states_temp_by_decades, usa_states_temp_by_decades$decades == period), map =states)+
  coord_map() +
  theme_minimal() +
  ggtitle(paste("Deviation from US 20th century Temperature Average\n", gsub('^(\\()([0-9]{4}),([0-9]{4})(\\])$', '\\2-\\3', period))) +
  theme(axis.line = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(), axis.title = element_blank(), plot.title= element_text(hjust=0.5, size = 16, face= "bold"),
        panel.grid.major = element_blank(), legend.position = "bottom", legend.text=element_text(size=12), legend.title=element_text(size=14)) +
  # reverse color palette (otherwise high = blue and low = red)
  scale_fill_manual(values = rev(getPalette(10)), limits = levels(usa_states_temp_by_decades$temp_bins), breaks = levels(usa_states_temp_by_decades$temp_bins), 
        labels=c("-1.2 to -0.8", "-0.8 to -0.6", "-0.6 to -0.4", "-0.4 to -0.2", "-0.2 to 0", "0 to 0.4", "0.4 to 0.8", "0.8 to 1.2", "1.2 to 1.6", "1.6 to 2"),
        name = "Temperature (C°)",
        guide =guide_legend(title.position = 'top', title.hjust = 0.5))
  }  


saveGIF(
    {for (period in unique(usa_states_temp_by_decades$decades[!is.na(usa_states_temp_by_decades$decades)]) ) {
    time_series = temp_change_GIF(period)
    
    temp_profile = ggplot(aes(x=Year, y = annual_temp ), data = annual_temp_usa) + geom_line() + geom_point(aes(color = "USA Temperature")) + geom_smooth(color = "darkblue") + 
                    geom_line(aes(x = Year, y = global_avg), data = global_temp) + 
                    geom_point(aes(x = Year, y = global_avg, color = "Global Temperature"), data = global_temp) + 
                    geom_smooth(aes(x = Year, y = global_avg, color = "Global Temperature"), data = global_temp) +
                    scale_x_continuous(breaks=seq(1900,2010,10))+
                    ggtitle("Average Global and USA Temperature Profile")+
                    ylab("Temperature (C°)") +
                    scale_y_continuous(limits = c(7,14)) +
                    theme(axis.text = element_text(size = 12), plot.title= element_text(hjust=0.5, size = 16, face = "bold"), 
                          legend.position = c(0.02, 0.9), legend.box = 'horizontal', legend.text=element_text(size=12)) +
                    scale_color_manual(values = c("USA Temperature"= "darkblue", "Global Temperature" = "darkred"), name = "") +
                    geom_hline( aes(yintercept = mean(annual_temp_usa$annual_temp), linetype = "USA 20th century Temperature Average") ) +
                    geom_hline( aes(yintercept = mean(global_temp$global_avg), linetype = "Global 20th century Temperature Average")) +
                    scale_linetype_manual(name = "", values = c("Global 20th century Temperature Average" = 1, "USA 20th century Temperature Average" = 3)) +
                    geom_segment(aes(x = Year, xend = Year, y = -Inf, yend = usa_high_temp*7.3 ), data = annual_temp_usa, size = 1.5, color = "darkblue") +
                    geom_segment(aes(x = Year, xend = Year, y = 7.3, yend = global_high_temp*8.0 ), data = global_temp, size = 1.5, color = "darkred")
    
    carbon_profile = ggplot(aes(x = Year, y =  YearAvgCarbon, color = "CO2 Concentration"), data = global_carbon)+
                      geom_line()+
                      geom_point()+
                      ylab("Carbon Dioxide (ppm)")+
                      ggtitle("Global Carbon Dioxide Concentration and\n Emission Increase (since 1900)")+
                      # in order to have two y axis with different scales in one plot, rescaling of one data set is required
                      # 1. scale percentage_increase (increase in CO2 emission in percent) by multiplying with 410/2000; a scale percentage_increase value of 2000 will be converted to 410
                      geom_line(aes(x = Year, y = percentage_increase * 410/2000),color="black", data = global_co2_emission)+
                      geom_point(aes(x = Year, y = percentage_increase * 410/2000, color = "CO2 Emission"), data = global_co2_emission) + 
                      geom_smooth(aes(x = Year, y = percentage_increase * 410/2000), data = global_co2_emission, color = "darkblue")+
                      # 2. include secondary y axis and rescale axis by multiplying with 2000/410; a tick at 410 on the secondary y axis will be converted to 2000
                      # after data and y-axis rescaling, correct values will match correct ticks on secondary y-axis
                      scale_y_continuous(sec.axis = sec_axis(~.*2000/410, name = "Increase in CO2 emission (%)" ) ) +
                      scale_x_continuous(breaks=seq(1900,2010,10)) +
                      theme(axis.text = element_text(size = 12), plot.title= element_text(hjust=0.5, size = 16, face = "bold"), 
                          legend.position = c(0.02, 0.9), legend.box = 'horizontal', legend.text=element_text(size=12)) +
                      scale_color_manual(values = c("CO2 Emission" = "darkblue", "CO2 Concentration"= "darkred"), name = "")
    
    grid.arrange(time_series, temp_profile,carbon_profile, ncol = 3, heights =c(0.1,0.01)) 
    ani.pause()
    }},movie.name = "Temperature_change.gif",ani.width = 1500, ani.height = 500)
knitr::include_graphics('Temperature_change.gif')

A GIF format provides a convenient means of presenting the seasonal temperature profile for all US states (instead of 51 scatter plots). The 20th century USA temperature average was used as the reference to depict temperature changes as the function of time. As we can already observe from the scatter plot in the middle panel, there is an inherent fluctuation in temperature over time, with higher temperatures followed by lower temperatures. To smooth those “short term” fluctuations in order to display the “long-term” trend for US states, each state’s moving 5-year temperature averages were calculated, starting at 1900, and assessed against the 20th century average (decrease vs. increase). Again, time-dependent fluctuations are still existent with 5-year averages. However, the trend is still very obvious. Overall average temperatures increase over time and the aforementioned fluctuations take place at different temperature range, which is constantly shifting towards higher absolute temperatures, i.e. for a high temperature/low temperature trend in later time periods, the lower temperature is higher in absolute value than the corresponding lower temperature in similar trends but in earlier time periods -and often even higher than the higher temperature in earlier periods- If we zoom out and have a look at the time-dependent temperature trend for the entire United States, or the global trend, we can observe an temperature increase from 1900 to 1940. From 1940 to 1970, the temperature shows a small decrease for the United States and no change on global scale. In summary, from 1900 to approximately 1970 the average temperature (fitted temperature) is below or close to the cognate 20th century temperature average. However, from 1970 onwards, there is a second period of even stronger temperature increase. This two step profile (increase-stagnation-increase) is also true for the temperature profile of the US states. The vertical blue and red bars (figure in middle panel) represent highest temperature records since 1900, which were determined as follows: Starting with the average temperature in 1900, the next highest temperature was recorded and set as highest temperature. This process was continued until the last temperature record in 2013. The data shows that the frequency of highest temperature measurements increases with the onset of the second warming phase (from 1970). Unfortunately the dataset is missing the measurements for the years 2014 to 2016/2017. However, based on data from NOAA (https://www.ncdc.noaa.gov/sotc/global/201613), 2014, 2015 and 2016 set new records on warmest years!! As such, the frequency for highest temperature records is even higher for the late 20th century as opposed to early 20th century. Temperature data for 2014, 2015 and 2016 was explicitly not included from other datasets, because it is not clear how temperature data was collected in different surveys, which makes them less comparable!! Moreover, only global temperature data was found for the period between 2014 and 2016. The figure in the right panel shows the global increase (1900 as reference) in annual Carbon dioxide emission for the 20th century and the atmospheric Carbon dioxide concentration since 1955. The emission profile exhibits a “slow” increase rate from 1900 to 1945, reaching approximately 200% higher emission values than in 1900. From 1945 global CO2 emissions began to skyrocket with an much higher rate in increase for the period until 2013. Moreover, the data shows a record emission increase for the year 2013 with almost 1800% more total emission compared to the reference value in 1900!! This is consistent with a constant increase in atmospheric CO2 concentration, albeit at a slower rate. This is due to the ocean’s ability to buffer a great amount of CO2, however, at the expense of gradual acidification. This, in turn, has a major impact on coral bleaching (together with global warming), which is another story but basically equivalent to extreme weather in terms of having a great negative impact on global ecology and thus, global economy and population!! In summary, Temperature data clearly supports on going global warming. If we project the temperature trend to include Years 2014-2016 (each consecutive year with a higher temperature) the rate of second warming period in the 20th century (from 1970) is even higher as depicted in the figure and overall considerably higher than for the first warming period (1900-1950). If we now combine data on Carbon dioxide emission and concentration, it suggests that this is very likely triggered by humans due to massive burning of fossil fuels, in particular after 1950.

Combining the results from the time series analysis on the number of extreme weather events with the temperature profile in the course of time, we already see the trend that global warming and increasing occurrence of extreme weather are correlated. Nevertheless, lets put the corresponding data into a quantitative context and calculate the respective correlation coefficients.

I assume that direct plotting of Temperature against Weather event counts will produce a biased correlation between the pairs of cognate variables. Based on the temperature time series analysis at US state-scale, we know that average State temperatures are increasing over time. Moreover, there is quite some variability in the 20th century average for different regions or States of the United States (E.g., high and low average temperatures for Florida and North Dakota). So by ignoring region specificity (State) and the timeframe (Year), we would encounter similar temperature ranges from different states but with the crucial difference that for some states a particular temperature range might be normal in terms of context to the 20th century average, whereas for other states the very same temperature range might represent increased temperatures -later period in 20th century- due to global warming. Since occurrence of a particular weather event depends on more geographical or climatological factors than merely temperature and we can expect a certain “normal baseline” in terms of extreme weather frequency for each state but concomitantly a state- or geographical location-specific baseline. If we don’t account for this variation, the analysis would ultimately include unnecessary noise. Basically, if we don’t group by State and Year (State scale) or Year (Country scale), we would have much more data points for plotting and calculation, however, with reduced power to discriminate “normal” from “increased” or “decreased” values of damage amounts/number of weather events as a function of temperature.

Consequently, because we want to test the hypothesis whether changes in temperature correlate with changes in event frequency (I don’t want to check for correlation between absolute temperatures and damage, event count) it is reasonable to either use annual US temperature averages with annual event frequency (events per year), or facilitate the analysis separately for each state (using corresponding values) followed by calculating the average.

# for each state, calculate annual number of extreme weather events
event_count_by_state_year = df_events_victims %>%
  group_by(STATE,YEAR) %>%
  summarise(event_count = n()) %>%
  ungroup

# for each state, combine data on temperature, damage and extreme weather count
# This data will be used for calculation of Pearson Correlation Coefficient between temperature vs. weather counts and temperature vs. damage for each state of the United States 
event_temp_by_state_year = subset(usa_temp, select = c("Year", "State", "YearAvgTemp_STATE")) %>%
  left_join(event_count_by_state_year, by = c("Year" = "YEAR", "State" = "STATE")) %>%
  filter(complete.cases(.))

In general, analysis at state-scale is the most accurate method if we have good data coverage for each state, i.e. temperature data and damage-/ weather counts for all states encompassing the entire period -or most part of it- between 1950 (first year in storm events dataset)-2013 (last year in temperature dataset). As already seen in the figure “Average Global and USA Temperature Profile”, the temperature data is very noisy at short timeframe scale and this fact can lead to wrong results and interpretations if the data range used for the analysis covers only a small part of the total time frame. E.g., the period from 1945 to 1970 would actually be indicative of global cooling instead of global warming (see figure “Average Global and USA Temperature profile”)!! The code below, returns for each state the total number and range (start-end Year) of data points, respectively. With the exception of Alaska, Hawaii, District of Columbia and Rhode Island, nearly full coverage over the entire period from 1950-2013 is given for all states.

# return the number of years for which data is present as well as the min and max year
for(state in unique(event_temp_by_state_year$State)){
  state_data = subset(event_temp_by_state_year, event_temp_by_state_year$State == state)
  print( paste(state, nrow(state_data), min(state_data$Year), max(state_data$Year)) )
}
## [1] "ALABAMA 64 1950 2013"
## [1] "ARKANSAS 64 1950 2013"
## [1] "COLORADO 64 1950 2013"
## [1] "CONNECTICUT 61 1950 2013"
## [1] "FLORIDA 64 1950 2013"
## [1] "GEORGIA 64 1950 2013"
## [1] "ILLINOIS 64 1950 2013"
## [1] "INDIANA 64 1950 2013"
## [1] "IOWA 64 1950 2013"
## [1] "KANSAS 64 1950 2013"
## [1] "KENTUCKY 63 1950 2013"
## [1] "LOUISIANA 64 1950 2013"
## [1] "MARYLAND 62 1950 2013"
## [1] "MINNESOTA 64 1950 2013"
## [1] "MISSISSIPPI 64 1950 2013"
## [1] "MISSOURI 64 1950 2013"
## [1] "NEBRASKA 64 1950 2013"
## [1] "NEW MEXICO 62 1950 2013"
## [1] "NORTH CAROLINA 64 1950 2013"
## [1] "NORTH DAKOTA 64 1950 2013"
## [1] "OHIO 64 1950 2013"
## [1] "OKLAHOMA 64 1950 2013"
## [1] "PENNSYLVANIA 64 1950 2013"
## [1] "SOUTH CAROLINA 64 1950 2013"
## [1] "SOUTH DAKOTA 64 1950 2013"
## [1] "TENNESSEE 64 1950 2013"
## [1] "TEXAS 64 1950 2013"
## [1] "WEST VIRGINIA 62 1950 2013"
## [1] "WISCONSIN 64 1950 2013"
## [1] "WYOMING 63 1950 2013"
## [1] "CALIFORNIA 61 1951 2013"
## [1] "MASSACHUSETTS 62 1951 2013"
## [1] "MICHIGAN 62 1951 2013"
## [1] "NEW HAMPSHIRE 62 1951 2013"
## [1] "NEW JERSEY 60 1951 2013"
## [1] "OREGON 58 1951 2013"
## [1] "VIRGINIA 63 1951 2013"
## [1] "ARIZONA 61 1952 2013"
## [1] "MONTANA 62 1952 2013"
## [1] "NEW YORK 61 1952 2013"
## [1] "MAINE 61 1953 2013"
## [1] "UTAH 60 1953 2013"
## [1] "VERMONT 60 1953 2013"
## [1] "DELAWARE 58 1954 2013"
## [1] "IDAHO 59 1954 2013"
## [1] "WASHINGTON 57 1954 2013"
## [1] "HAWAII 35 1955 2013"
## [1] "RHODE ISLAND 44 1956 2013"
## [1] "NEVADA 56 1957 2013"
## [1] "DISTRICT OF COLUMBIA 35 1975 2013"
## [1] "ALASKA 19 1993 2013"

I will calculate correlation coefficients for each State and compare with the average for the United States, which is simply inferred by taking the States’ correlation coefficients’ mean. Can we identify states or regions that are more or less prone to the consequences of global warming?

# calculate for each state the correlation coefficient for temperature and number of extreme weather events and temperature and damage, respectively
calculation_cor_coef = function(){
  cor_coef_all_states = data.frame("State" = NULL, "Cor_coef_damage" = NULL, "Cor_coef_events" = NULL)
  for(state in unique(event_temp_by_state_year$State)){
      cor_coef_events = with(subset(event_temp_by_state_year, event_temp_by_state_year$State == state), cor.test(YearAvgTemp_STATE, event_count))
      cor_coef_state = data.frame("State" = state, "Cor_coef_events" = cor_coef_events$estimate, row.names = NULL)
      cor_coef_all_states = bind_rows(cor_coef_all_states, cor_coef_state )
  }
  return(cor_coef_all_states)
}

# returns correlation coefficient by state
cor_coeff_states = calculation_cor_coef()

# define the top20 states with respect to strongest correlation between temperature and number of events
cor_coeff_events_top20 = cor_coeff_states %>%
  arrange(desc(Cor_coef_events)) %>%
  # divide states into one group harbouring the top20 states and the rest; defining the top20 as 1 and all other states as 0 allows to specifically color top20 states with ggplot's geom_map()
  transform(top20 = c(rep(1,20),rep(0,nrow(cor_coeff_states)-20)) )%>%
  # include state fips codes for correct plotting of states with geom_map()
  left_join(total_casualty_by_state[,1:2], by = c("State" = "STATE"))
p12=ggplot() + 
  geom_map(aes(fill = as.factor(top20), map_id = FIPS ), data = cor_coeff_events_top20, map =states)+
  geom_map(aes(x=long, y=lat, map_id = id), data=states, map = states,color ="black", size=0.5, fill=NA)+
  coord_map() +
  theme_minimal(base_size = 12) +
  ggtitle("Top20 Correlations by State") +
  theme(axis.line = element_blank(), axis.text = element_blank(),
        axis.ticks = element_blank(), axis.title = element_blank(), plot.title= element_text(hjust=0.5, size = 14, face= "bold"),
        panel.grid.major = element_blank(), legend.text=element_blank(), legend.title=element_text(size=12)) +
  scale_fill_manual(values = c("white", "darkred"), name = "", guide = F)

p13=ggplot(aes(x = State, y = Cor_coef_events), data = cor_coeff_states) + 
  
  geom_hline( aes(yintercept = mean(cor_coeff_states$Cor_coef_events), linetype = "temperature-sqrt(events)"))+
  geom_point(aes(color = "temperature-sqrt(events)"),shape = 15,  cex = 2)+
  ggtitle("Correlation coefficients by State") +
  ylab("Pearson's Correlation Coefficient")+
  # order the states according to decreasing temperature-events correlation coefficient 
  scale_x_discrete(limits = cor_coeff_states[order(cor_coeff_states$Cor_coef_events, decreasing = T),]$State) +
  scale_y_continuous(limits = c(-1,1), breaks = seq(-1,1,0.2))+
  theme(plot.title= element_text(hjust=0.5, size = 14, face= "bold"), axis.text.x = element_text(angle = 45, hjust = 1), axis.text = element_text(size = 10), 
        legend.title = element_text(size = 12, face= "bold"), legend.position = c(0.02, 0.2), legend.text=element_text(size=10), panel.grid.major = element_line(colour="grey95", size=0.5),
        legend.background = element_rect(fill="white"), legend.box = 'horizontal')+
  scale_color_manual(values=c("temperature-sqrt(events)"="darkred"), name = "Correlation at STATE scale")+
  scale_linetype_manual(values=c("temperature-sqrt(events)" = 2), name = "Correlation at USA scale")

grid.arrange(p13,p12,ncol=2,widths=c(10,5),top=textGrob("Correlation between Temperature and Number of Extreme Weather", gp=gpar(fontsize=16, fontface="bold")))

At Country scale -referred to the USA average-, there is a moderate correlation between Temperature and the number of extreme weather occurrences (Pearson’s correlation coefficient: 0.47). At State scale, however, we can see stronger correlation coefficients (~0.6-0.7) for some States, such as New Mexico, Arizona, New Jersey, etc. Intriguingly, at State scale the top 20 ranked States in terms of highest Correlation Coefficient (Temperature and Extreme Weather) are either coastal states or in vicinity to coastal states close to the Pacific, Gulf of Mexico and Atlantic. This is intriguing, because recent studies reported increasing hail and tornado frequency with increasing temperature in the Gulf of Mexico (http://onlinelibrary.wiley.com/doi/10.1002/2016GL071603/full). Atlantic ocean’s surface temperatures are known for periodic transition between warm and cold periods, a phenomenon called the Atlantic Multidecadal Oscillation (https://www.nature.com/articles/nature14491). Similarly to the inherent influence of the Gulf of Mexico, warm Atlantic periods are associated with higher frequency of extreme weather (http://www.iflscience.com/environment/atlantic-entering-cool-phase-will-change-world-s-weather/). The period from 1960 to 2016 experienced a transition from cold to warm (http://www.iflscience.com/environment/atlantic-entering-cool-phase-will-change-world-s-weather/, http://stateoftheocean.osmc.noaa.gov/atm/amo.php). As such, there is likely an additive effect from increasing surface and sea temperatures, which might explain the highest correlations between temperature and extreme weather frequency for locations with spatial proximity to sea.

Can we analyse the data at reduced reporting bias for severe weather events

As already mentioned, I assume that the number of recorded extreme weather events does not reflect the true number of events occurred each Year. In fact, I expect more recorded events in the later phase of the 1950-2017 period due to the advent of new technology, which improved detection and even more, very likely significantly reduced the effort it takes to report an event and the corresponding consequences (damage, injuries, etc.). However, I would also expect less bias in reporting for major extreme weather events, because they have greater impact on more people and as such are more likely to be detected and recorded. Based on this, I assume that major events (defined as: at least 50000000 USD damage cost incurred) are a good proxy whether occurrence of extreme weather events changed over the period of 1950-2017.

# returns the annual (1950-2017) total number of events as well as the annual number of "high total damage" events (defined as events with damage >= $50 Mio. USD) 
# df_total_damage is the version of the main data frame df_events_victims, however, with DAMAGE_PROPERTY and DAMAGE_CROPS combined to total_damage  
timeseries_events = subset(df_total_damage, select = c("decades_buckets", "YEAR", "EVENT_CATEGORY", "damage_combined")) %>%
  group_by(YEAR) %>%
  summarise(counts = n(), count_extreme = sum(damage_combined>=50000000))
p14=ggplot(aes(x = YEAR, y = counts, group=1), data = timeseries_events) + geom_line()  +
  scale_y_continuous(lim=c(0,100000), breaks=seq(0,100000,20000))+
  scale_x_continuous(breaks=seq(1950,2018,4))+
  ylab("Number of Events")+
  xlab("Year")+
  geom_point(aes(size = count_extreme), alpha = 1/2, shape =21, fill = I('#F79420')) +
  scale_size( "Number of Events with\nat least $50 Mio. Damage", breaks=c(0,25,50,75,100,200,500,1000,1500),labels=c(0,25,50,75,100,200,500,1000,1500), range=c(0,15)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size =10), axis.text.y = element_text(size =10), axis.title =element_text(size=12)) +
  annotate(size = 4, hjust =0, "text", x = 1950, y = 90000, label = paste( "Correlation Coefficient (Year - log10($50 Mio. Damage Events)): ", round((with(timeseries_events, cor.test(YEAR, log10(count_extreme+1))))$estimate,2) ))


grid.arrange(p14,  top=textGrob("Time series of the number of total extreme weather events\nand high-damage ($50 Mio. and more) events",  gp=gpar(fontsize=14, fontface="bold")))

We can see that the number of extreme weather events increased over time and exhibits a linear trend from 1950 to approximately 2005. From 2005 to 2016, there is an exponential jump in the number of events. The huge drop in event counts from 2016 to 2017 is not representative and can be explained by incomplete months for the year 2017. The last record in the data set for 2017 is dated to April (see below), so there are no measurements reported yet for 8 month as opposed to 2016. Strikingly, between 1950 and 1995 the number of events that caused damage costs of at least 10 Mio. USD per year is in the lower 2-digit range (25 or less occurrences), whereas those occurrences exhibit approximately a 4-20 fold increase from 1996 onwards!! So clearly, the number of extreme weather events with major financial burden significantly increased over a period of only 70 years having a Pearson’s correlation coefficient of 0.85!!

tail(subset(df_events_victims, select = c("YEAR", "MONTH")),1)
##         YEAR MONTH
## 1442330 2017     4

One major issue when analysing monetary costs, however, is inflation! Due to inflation, the inherent monetary value for damage costs is actually different for each year. E.g., taking 2017 as reference, a virtual shopping cart valued with $100 cost substantially less in the year 1900, i.e. due to inflation, there is monetary loss in value over time. Since We want to compare events with similar monetary value, we have to normalize the data and correct for inflation. I used historical consumer price index data (https://inflationdata.com/Inflation/Consumer_Price_Index/HistoricalCPI.aspx?reloaded=true) for normalization of damage costs.

# load the US consumer price index data
cpi_index = read.csv("CPI_usa.csv")
# calculate cpi for 2017 based on data from jan to jul; the main data set contains data only until 2016; values for 2017 are obtained from www.inflationdata.com
cpi_index$Price[1] = mean(c(242.839,    243.603,    243.801,    244.524,    244.733,    244.955,    244.786))

# use hash function to store CPI for each year
hash_cpi = hash(cpi_index$Year, cpi_index$Price)
# function to return CPI for the corresponding year (used with apply)
return_cpi_value = function(hash_key){
  return(hash_cpi[[as.character(hash_key)]])
}

# insert cpi in the main dataset, because element-wise returning of hashed cpi values not directly possible using dplyr's transform function (not sure why?) 
# --> transform(DAMAGE_PROPERTY_inflation_clean = (hash_cpi[["2017"]]/hash_cpi[["as.character(YEAR)]])*DAMAGE_PROPERTY ) will return an ERROR
df_events_victims["CPI"] = apply(subset(df_events_victims, select=c("YEAR")), 1, function(x) return_cpi_value(x))

# use the cpi value to correct for inflation
df_events_victims = df_events_victims %>%
  transform(DAMAGE_PROPERTY_inflation_clean = (hash_cpi[["2017"]]/CPI)*DAMAGE_PROPERTY ) %>%
  transform(DAMAGE_CROPS_inflation_clean = (hash_cpi[["2017"]]/CPI)*DAMAGE_CROPS )

# combine inflation cleaned variant of property (DAMAGE_PROPERTY_inflation_clean) and crops damage (DAMAGE_CROPS_inflation_clean) to calculate the total damage for each event
df_total_damage_inflation_cleaned = df_events_victims %>% 
  mutate_at(vars(DAMAGE_CROPS_inflation_clean), funs(ifelse(!is.na(DAMAGE_PROPERTY_inflation_clean) & (is.na(DAMAGE_CROPS_inflation_clean) | DAMAGE_CROPS_inflation_clean==0),0,
  DAMAGE_CROPS_inflation_clean ))) %>%
  mutate_at(vars(DAMAGE_PROPERTY_inflation_clean), funs(ifelse(!is.na(DAMAGE_CROPS_inflation_clean) & (is.na(DAMAGE_PROPERTY_inflation_clean) | DAMAGE_PROPERTY_inflation_clean==0),0,
  DAMAGE_PROPERTY_inflation_clean ))) %>%
  transform(damage_combined = DAMAGE_PROPERTY_inflation_clean + DAMAGE_CROPS_inflation_clean) %>%
  filter(!is.na(damage_combined)) 

# returns the same variables as in timeseries_events, however with inflation correction
timeseries_events_inflation_cleaned = subset(df_total_damage_inflation_cleaned, select = c("YEAR", "EVENT_CATEGORY", "damage_combined")) %>% 
  group_by(YEAR) %>%
  summarise(counts = n(), count_extreme_normalized = sum(damage_combined>=50000000))
p15 = ggplot(aes(x = YEAR, y = counts, group=1), data = timeseries_events_inflation_cleaned) + geom_line()  +
  scale_y_continuous(lim=c(0,100000), breaks=seq(0,100000,20000))+
  scale_x_continuous(breaks=seq(1950,2018,4))+
  xlab("Year")+
  geom_point(aes(size = count_extreme_normalized), alpha = 1/2, shape =21, fill = I('#F79420')) +
  scale_size(breaks=c(0,25,50,75,100,200,500,1000,1500),labels=c(0,25,50,75,100,200,500,1000,1500), range=c(0,15)) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size =10), axis.text.y = element_blank(),
        axis.ticks.y = element_blank(), axis.title.y = element_blank())+
  annotate(size = 4, hjust =0, "text", x = 1950, y = 90000, label = paste( "Correlation Coefficient (Year - log10($50 Mio. Damage Events)): ",
  round((with(timeseries_events_inflation_cleaned, cor.test(YEAR, log10(count_extreme_normalized+1))))$estimate,2) ))+
  guides(size=F)


plot_grid(p14,p15, ncol=2, rel_widths=c(7,5), labels = c("Damage Costs without Inflation Correction", "Damage Costs with Inflation Correction"))

Even after normalizing damage costs on monetary value, the data implies that the number of events incurring over $50 Mio. damage costs is increasing over time. With a value of 0.78, the Pearson’s correlation coefficient is still in support of a strong correlation between Year and the number of events with at least 50 Mio. USD Damage (log10 scale). However, one factor necessary to be accounted for is the expenditure by the construction sector over the entire period under investigation. Infrastructure investments are made on annual basis and as such we can expect an increase in “average value per square metre”. This is very important, because increase in damage costs can be attributed to higher frequency of extreme weather incidents or alternatively to increased property value without changes in event frequency. It is quite difficult to normalize for infrastructure investments, because additionally it would require expanding the normalization for density of such construction work, i.e., where did the investments take place (entire United States versus particular areas), what kind of investments (few but very costly investments versus many but less costly investments). In summary, it can be anticipated that anthropogenic changes in landscapes (due to on going construction) increases the likelihood of extreme weather disasters.

Let’s go into more detail and analyse the distribution of extreme weather occurrences and the aforementioned severe variants (damage costs >= 50 Mio. USD) for each weather category. For this purpose Inflation cleaned data will be used.

# returns total number of events as well as the number of "high-damage" events (defined as events with damage >= $50 Mio. USD) for periods of 5 years (starting from 1950)
# df_total_damage_inflation_cleaned is the version of the main data frame df_events_victims, however, with inflation cleaned  DAMAGE_PROPERTY and DAMAGE_CROPS combined to total_damage  
events_by_decade_category = subset(df_total_damage_inflation_cleaned, select = c("decades_buckets", "EVENT_CATEGORY", "damage_combined")) %>% 
  group_by(decades_buckets, EVENT_CATEGORY) %>%
  summarise(counts = n(), damage_total = sum(damage_combined), count_extreme = sum(damage_combined>=50000000)) %>%
  # events for the years 2016 and 2017 have been omitted for consistency (all 5 year bins). those will be represented as NA in decades_buckets and, thus, should be removed
  filter(!is.na(decades_buckets))  %>%
  ungroup() 
ggplot(aes(x = decades_buckets, y = counts, group=EVENT_CATEGORY), data = events_by_decade_category) + geom_line()  +
  ggtitle("Time series of the total number of extreme weather events and\n high-damage events for each category") +
  xlab("Period in Years")+
  ylab("Number of Events")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size =10), axis.text.y = element_text(size =10), axis.title = element_text(size =12), plot.title= element_text(hjust=0.5, size=14, face = "bold"), legend.text=element_text(size=10), legend.title=element_text(size=12), strip.text = element_text(size=12)) +
  scale_x_discrete(labels=c("1950-1955", "1955-1960", "1960-1965", "1965-1970", "1970-1975", "1975-1980", "1980-1985", "1985-1990", "1990-1995", "1995-2000", "2000-2005", "2005-2010", "2010-2015")) +
  scale_y_log10(breaks=c(1,10,100,1000,10000, 100000)) +
  facet_wrap(~EVENT_CATEGORY) +
  geom_point(aes(size = count_extreme), alpha = 1/2, shape =21, fill = I('#F79420')) +
  scale_size( "Number of Events with\nat least $50 Mio. Damage", breaks=c(0,25,50,75,100,200,400,800,1200),labels=c(0,25,50,75,100,200,400,800,1200),range = c(0,12))

All weather categories exhibit an increase in occurrence, however, only Hail, Storm and Tornado category, respectively, have continuous records from 1950-2017. It is conspicuous that for all remaining categories the first records start in 1995. This indicates that the data set is incomplete. Also, most of the categories show an increase in events causing a total damage of at least 50 Mio. USD over the period from 1950-2017. As depicted in figure Time series of the number of total extreme weather events and high-damage ($50 Mio. and more) events, 2005 shows above average activity in terms of extreme costly events with a total of 1000 or more occurrences. It is evident from the figure shown above that the majority of these events belong to the storm category. Interestingly, this is consistent with reports on 2005 being the most active hurricane season on record (http://www.intellicast.com/storm/summary/)

I want to take a yet another approach on reducing the bias in analysing extreme weather events. Similar to damage costs, I assume that the most severe Tornadoes -F4 and F5 category- are less prone to detection and report bias and are likely a less biased proxy to assess the occurrence of extreme weather events over time.

table(df_events_victims$TOR_F_SCALE)
## 
##             EF0     EF1     EF2     EF3     EF4     EF5     EFU      F0 
## 1376146    7520    4716    1475     624     362     318      47   20773 
##      F1      F2      F3      F4      F5 
##   16931    9043    3085    1088     202

The names designated to describe tornado categories are not consistent. The categories are a mixture of the “Fujita scale” (E.g., “F5”) and “Enhanced Fujita Scale” (E.g., “EF5”), which can be used interchangeable (http://www.tornadofacts.net/tornado-scale.php). Before continuing with the analysis, they will be transformed to “Fujita Scale”.

# transform all Tornado categories to Fujita scale by using regex to ommit the E from EF classification
consistent_tornado_scale = function(entry){
  # takes TOR_F_SCALE value and makes transformation
  return(gsub("(\\w)(\\w[0-9]$)","\\2",entry))
}

tornado_occurences_timeseries = subset(df_events_victims, select=c("STATE", "STATE_FIPS", "YEAR", "TOR_F_SCALE"))  %>%
  # apply Tornado category transformation 
  mutate_at(vars(TOR_F_SCALE), funs(consistent_tornado_scale(.))) %>%
  # group dataset by year and Tornado category
  group_by(YEAR, TOR_F_SCALE) %>%
  summarise(count = n())%>%
  # There are also empty strings ("") as TOR_F_SCALE values; transform "" to NA and remove by filtering
  mutate_at(vars(TOR_F_SCALE), funs(ifelse(.=="", NA,TOR_F_SCALE)) ) %>%
  filter(!is.na(TOR_F_SCALE)) %>%
  ungroup()

# cast dataset into wide format (each category as a column vector with corresponding path values)
tornado_occurences_timeseries_wide = dcast(subset(tornado_occurences_timeseries, TOR_F_SCALE != "EFU"), YEAR ~TOR_F_SCALE)

# calculate correlation coefficient between Year var. and each Tornado category var.
cor_coeff_occurences_year_by_category = round(apply( subset(tornado_occurences_timeseries_wide, select=c("F0", "F1", "F2", "F3", "F4", "F5")), 2, function(x) (cor.test(tornado_occurences_timeseries_wide$YEAR, log10(x))$estimate) ),2)
ggplot(aes(x=YEAR, y=count ), data = subset(tornado_occurences_timeseries, !(TOR_F_SCALE=="EFU")) ) + 
  geom_point() + 
  ggtitle("Annual Occurence of Tornadoes (by Category)") +
  xlab("Year")+
  geom_smooth(method = "lm")+ 
  scale_x_continuous(breaks=seq(1950,2018,5))+
  scale_y_log10(limit=c(1,10000), breaks=c(1,10,100,1000,10000))+
  ylab("Number of Tornadoes")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), axis.text.y = element_text(size =10), axis.title =element_text(size =12), 
        plot.title= element_text(hjust=0.5, size = 14, face = "bold"), strip.text = element_text(size=12)) +
  facet_wrap(~TOR_F_SCALE) +
  annotate(size=4.0, hjust =0, "text", x = 1950, y = 5000, label =  paste("Correlation Coefficient: ", cor_coeff_occurences_year_by_category ) )

Using a linear model to depict the trend in changes of the annual number of Tornados as a function of time, we can see that Tornadoes of category F0 exhibit the strongest increase of annual occurrences over time. Similarly, F1 category events show an time-dependent increase of annual occurrences. F3 and F4 Tornadoes, however, exhibit no significant change in annual occurrences over time. Strikingly, the estimated trend for the most severe category, F5, is also positive, exhibiting a more than 2-fold increase in annual occurrences between 1950 and 2017. If we take into account that dataset for the year 2017 is incomplete and based on the analysis of seasonal distribution, the major part of the Tornado season is not included, one could speculate that the positive trend is even more pronounced. As already noted, this trend might be reflected by increasing temperature in the Gulf of Mexico (http://onlinelibrary.wiley.com/doi/10.1002/2016GL071603/full).

In the context of the mentioned report that is addressing the temperature effect of the Gulf of Mexico on Tornado frequency, I would like to test one last hypothesis. Tornadoes mostly originate from storms, which in turn are “fuelled” with energy by warm moist air (http://www.nssl.noaa.gov/education/svrwx101/thunderstorms/, http://onlinelibrary.wiley.com/doi/10.1002/2016GL071603/full). With increasing longevity of Storms, one can anticipate that not only Tornado formation is more likely to occur, but also Tornadoes with increased lifespan. Consequently, I am curious, whether the data provides any evidence that the average travelled Tornado distance has increased over time due to increased surface temperatures that could provide more energy to Tornado formation and, thus, potentially increase the longevity. Again, I expect that this approach is less prone to detection and report bias, albeit the calculated average distance is more robust with increasing number of occurrences. I will use the Harvesine formula (https://en.wikipedia.org/wiki/Haversine_formula) from geosphere package to calculate the distance between start and end coordinates of recorded events. Using the distm() function (harvestine formula) returned following error “Error in .pointsToMatrix(y) : longitude < -360”. It seems that the dataset contains longitude values outside the valid range, which is -180 to +180 for longitude and -90 to +90 for latitude (http://www.geomidpoint.com/latlon.html). We can see the problem by auditing for valid range.

tornado_path_distance = subset(df_events_victims, select=c("YEAR","TOR_F_SCALE","EVENT_CATEGORY","BEGIN_LON", "BEGIN_LAT", "END_LON", "END_LAT")) %>%
  filter(EVENT_CATEGORY == "Tornado") %>%
  filter(complete.cases(.)) 

# show BEGIN_LON entries with invalid values (> 180 vs. < -180)
head(subset(tornado_path_distance, tornado_path_distance$BEGIN_LON < -180),2)
##       YEAR TOR_F_SCALE EVENT_CATEGORY BEGIN_LON BEGIN_LAT END_LON END_LAT
## 17702 2000          F0        Tornado   -814.05     31.12 -814.05   31.12
## 17852 2000          F0        Tornado   -764.00     41.35  -76.63   41.38

There is a floating point problem in the longitude variable (problematic longitude values are solely < -180; no invalid range for latitude and no longitude values > +180). For some values the decimal point is obviously at the wrong place. Before I continue with calculating the distance I will correct the longitude variable by moving the decimal point to the next lower decimal place for all values with decimal point at third position.

correct_longitude = function(lon) {
  if(lon < -180){
    # takes BEGIN_LON, END_LON values (lon) and changes the decimal position if appropriate
    # longitude values need to be transformed to correct string representation in order for regex/gsub to work correctly. I assume that regex/gsub first converts to string before applying
    # regex check, because if the last two positions in the number are zeros, as in -701.00, no valid regex match will result and the gsub groups will be messed up. As such, transform
    # longitude value using formatC() to keep floating point precision:
    # --> 700.00 will be converted to "700.00" instead of "700" when using as.character() conversion
    lon = formatC( round( lon, 2 ), format='f', digits=2 )
    lon_correct = gsub("(-\\d{2})(\\d)(\\.)(\\d+)", "\\1\\3\\2\\4" , lon )
  }else{
    lon_correct = lon
  }
  return( as.numeric(lon_correct))
}

# vectorize function for proper usage with dplyr's mutate_at() function; if clause can't handle vectors, returning following error: "the condition has length > 1 and only the first element will be used"
correct_longitude = Vectorize(correct_longitude, USE.NAMES = F)

tornado_path_distance = tornado_path_distance %>%
  # you can also use ifelse() with mutate_at() function, however, at expense of good overview
  mutate_at(vars(BEGIN_LON, END_LON), funs(correct_longitude(.)))
 
# reclassify Tornado categories to Fujita scale (see prior section)
tornado_path_distance["TOR_F_SCALE"] = apply(subset(tornado_path_distance, select=c("TOR_F_SCALE")), 2,function(x) consistent_tornado_scale(x) )
 
# For each Tornado event, calculate the distance in m, based on longitute/latitude coordinates for 2 points
tornado_path_distance["distance"] = apply(subset(tornado_path_distance, select = c("BEGIN_LON", "BEGIN_LAT", "END_LON", "END_LAT")), 1, function(x) distm(c(x[1],x[2]), c(x[3],x[4]),fun = distHaversine ))

# Calculate the average distance for each year (no differentiation between categories)
tornado_path_distance_by_year = tornado_path_distance %>%
  group_by(YEAR) %>%
  summarise(avg_distance = mean(distance/1000))
  
# Calculate the average distance for each year and each category
tornado_path_distance_by_year_category = tornado_path_distance %>%
  group_by(YEAR, TOR_F_SCALE) %>%
  summarise(avg_distance = mean(distance/1000)) %>%
  mutate_at(vars(TOR_F_SCALE), funs(ifelse(.=="", NA,TOR_F_SCALE)) ) %>%
  filter(!is.na(TOR_F_SCALE)) %>%
  ungroup()
ggplot(aes(x=YEAR, y = avg_distance), data = tornado_path_distance_by_year) +
  geom_point()+geom_smooth(method="lm")+
  scale_x_continuous(breaks=seq(1950,2018,5))+
  ggtitle("Average Tornado Travel Distance") +
  xlab("Year")+
  ylab("Distance (km)")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), axis.text.y = element_text(size =10), axis.title =element_text(size =12),
        plot.title= element_text(hjust=0.5, size = 14, face = "bold")) +
  annotate(size=4, hjust =0, "text", x = 1950, y = 5, label =  paste("Correlation Coefficient: ", round(with(tornado_path_distance_by_year, cor.test(YEAR, avg_distance))$estimate,2) ) )

I was quite surprised because the results contrasted what I expected. Especially the effect size. The average path distance for a Tornado has not increased over time but actually the data provides compelling evidence for a strong negative correlation with time. At first I wasn’t really sure if this result had some inherent bias of which I was not aware of.

However, it occurred to me that if there is not a significant change in path distance over time and if lower category Tornados have a shorter path distance on average then this might confound the “true” trend for path distance over time, as the number of lower category Tornados is increasing at highest rate over time and also at later period low category Tornadoes are by the most abundant ones. Consequently, analysing the path distance over time separately for each Tornado category should reduce this bias.

# cast dataset into wide format (each category as a column vector with corresponding path values)
tornado_path_distance_by_year_category_wide = dcast( subset(tornado_path_distance_by_year_category, TOR_F_SCALE != "EFU"), YEAR ~TOR_F_SCALE)

# calculate correlation coefficient between Year var. and each Tornado category var.
cor_coeff_dist_year_by_category = round(apply( subset(tornado_path_distance_by_year_category_wide, select=c("F0", "F1", "F2", "F3", "F4", "F5")), 2, function(x) (cor.test(tornado_path_distance_by_year_category_wide$YEAR, x)$estimate) ),2)
# exclude EFU category, which probably stands for "unknown"
ggplot(aes(x=YEAR, y = avg_distance), data = subset(tornado_path_distance_by_year_category, TOR_F_SCALE != "EFU") ) +
  geom_point()+geom_smooth(method="lm")+
  scale_x_continuous(breaks=seq(1950,2018,5))+
  ggtitle("Average Tornado Travel Distance (by category)") +
  xlab("Year")+
  ylab("Distance (km)")+
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), axis.text.y = element_text(size =10), axis.title =element_text(size =12), 
        plot.title= element_text(hjust=0.5, size = 14, face = "bold"), strip.text = element_text(size=12)) +
  facet_wrap(~TOR_F_SCALE)+
  annotate(size=4.0, hjust =0, "text", x = 1960, y = 70, label =  paste("Correlation Coefficient: ", cor_coeff_dist_year_by_category ) )

Faceting by Tornado category shows that the negative correlation between path distance and year is not merely biased by averaging over all categories, as mentioned before. However, the plots reveal that reduced path distance over time is predominantly true for low category Tornadoes (F0-F3), whereas more severe categories exhibit weak or no significant correlation. So how can we interpret the results? Are they reflecting an effect of anthropogenic landscape changes (construction work) on Tornado movement, i.e. is there any “energy-reducing or stopping” effect exerted by buildings? On the other hand, are the results nevertheless reflecting inherent detection/report bias, i.e. a more reliable and accurate estimation of Tornado paths with the advent of Doppler radar, implying an overestimation of travel distance before? I was not able yet to find conclusive reports/research on either of the hypotheses.

Final Plots and Summary

In the following section I will highlight some major findings.

Plot one

knitr::include_graphics('Temperature_change.gif')

The aim of this analysis is to survey the potential effect of climate change on extreme weather occurrences. The terms climate change and global warming are often used interchangeably. Therefore, let’s first start with exploring the temperature trend for the 20th century. We can see the same trend at different scales -US County scale, US State scale, Global scale-. Annual temperatures are below the 20th century temperature average for the first half of the 20th century, however already exhibiting higher annual temperatures every year. For the second half of the 20th century, the annual temperatures surpass the 20th century average. Furthermore, at higher rate than the temperature increase for the first part of the 20th century, especially if we extrapolate temperature data for the years 2014-2016 (https://www.ncdc.noaa.gov/sotc/global/201613). This is even more highlighted by the number of average temperature records for the 20th century, i.e. a temperature higher than all measured temperatures before (red and blue bars). We can see a discrepancy between the first and second half of the 20th century. Clearly, the frequency between temperature records is decreasing over time. Again, bear in mind that 2014-2016 are all years with new temperature records (https://www.ncdc.noaa.gov/sotc/global/201613). As such, the analysis of historical and current surface temperature data is clearly supportive of global warming.
Integrate data on global Carbon dioxide concentration and - emission, is furthermore supportive of anthropogenic influence on global warming. Starting with the onset of industrial revolution, CO2 emission rates continuously increased, with extremely high rates from 1950-1955, continuously increasing CO2 concentration in the atmosphere. The true dimension of anthropogenic CO2 addition becomes really apparent if we put the 20th century increase in CO2 concentration in a historical context ranging 400.000 years back, showing atmospheric CO2 concentration fluctuating between 180 and 300 ppm (https://climate.nasa.gov/vital-signs/carbon-dioxide/)!! So the current CO2 record is clearly caused by human activity, enhancing the greenhouse effect, which is responsible for surface warming. Of course, we can’t imply causation between CO2 emission and global warming. Nevertheless it is intriguing that the second and much more pronounced warming period in the 20th century is correlating with the anthropogenic CO2 increase!

Plot two

knitr::include_graphics('Extreme_Weather_timeseries_All combined.gif')

This plot shows the number of combined (all categories) extreme weather events in the United States at US County scale for different time periods and the dynamics of frequency (number of events per 5-year period) change over a 70 years period. Overall, the time series analysis shows that the frequency of extreme weather occurrences is continuously increasing over time. Moreover, the likelihood of encountering extreme weather seems to continuously increase for US cities with 500.000 or more population (red dots), which include the majority of US population, as those cities belong to areas of highest event frequency (80 or more events). So we can anticipate high impact of extreme weather on the major portion of US population.

Plot three

plot_grid(p14,p15, ncol=2, rel_widths=c(7,5), labels = c("Damage Costs without Inflation Correction", "Damage Costs with Inflation Correction"))

First, we have to keep in mind that the mere number of events and the change in event frequency, respectively, is likely biased. With growing population, the number of so called storm spotter (trained volunteers) is increasing. Furthermore, with the onset of the second half of the 20th century, Doppler radar technology was implemented in metrological surveillance and this technology was continuously refined since then (https://en.wikipedia.org/wiki/Weather_radar). Based on this progression, it is very reasonable to assume that the overall capacity and capability of detecting extreme weather events is increasing over time. As such, the annual number of events is likely overestimated due to advent in technology and more manpower. I assume, however, that the mentioned bias is most pronounced in less severe events and less influential with severe events, respectively. On the one hand, severe events wouldn’t necessarily require sophisticated technology for being detected. On the other hand, because of greater impact on ecology, economy and society, I expect a much higher willingness for extensive reporting. Consequently, I expanded the plots showing the number of events as a function of time by the annual number of “most severe events”" (arbitrary definition; here denoted as events causing 50 Mio. USD or more damage). We can clearly see that there is an on going increase of such events over time (left figure). Especially from 1992, the annual number of such events is significantly increased. The drop in 2017 is explained in the corresponding section. Inflation can be a pitfall when comparing monetary values from different years. In order to improve the analysis I normalized the damage values to historic customer price index, which corrects for inflation. The results are depicted in the right figure. After correcting for inflation the data still suggests an increase in extreme weather events that have caused damage of at lest 50 Mi. USD. A reduced correlation coefficient of 0.78 is still representing a strong correlation between the number of extreme weather and time (represented as year).

Plot four

plot_grid(p9,p10,p11, align = "hv", ncol = 3)

This panel of plots is very interesting because it highlights regions with highest risk for extreme weather (right plot). We can see that especially the central and eastern part of the United States is confronted with high extreme weather activity. Moreover, plotting the total number of casualties (left plot) and damage incurred (right plot) for each state, very well reflects those regions with highest extreme weather probability. So extreme weather is a serious threat, causing devastating damage and claiming lives of many people. Decision-makers should at least make use of such data (analysis) to evaluate regional risk for extreme weather and make appropriate precautions to minimize the consequences. This is especially controversial in context of recent hurricane Harvey that hit Texas and is claimed to be one of the costliest storms in US history with the estimated damage being as high as $100 Billion (https://www.theguardian.com/us-news/2017/aug/29/total-harvey-cost-insurance-texas-tropical-storm-hurricane-sandy) and only 40% the total damage on private individuals are estimated to be covered by insurance (https://www.nytimes.com/2017/08/28/business/dealbook/flood-insurance-harvey.html?mcubz=3)!

Reflection

Thoughts on technical implementation

First, I want to reflect on technical issues with respect to this analysis. Having more practise in Python so far, I decided to write Python scripts in order to scrape the necessary datasets. Relevant scripts were used to obtain following data:

  1. Data on extreme weather from the National Climatic Data Centre archive (https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/)
  2. Data on US and global temperatures from KAGGLE (www.kaggle.com)
  3. Data on global carbon dioxide emission from the Carbon Dioxide Information Analysis Centre (http://cdiac.ornl.gov/CO2_Emission/timeseries/global)
  4. US census data from www.populationpyramid.net
  5. Consumer Price Index data from www.inflationdata.com

Using Python modules “requests” and “BeautifulSoup” makes acquisition of data in HTML format fairly straightforward. The browser developer tool was used to determine how data is stored within the HTML document and based on this, how to retrieve the data via BeautifulSoup parsing. However, I experienced some struggle with the US census data acquisition. The relevant data -percentage of each population group- was embedded as a graphic within a svg element. Just as expected, the entire data structure was accessible via the browser developer tool but I was not able to parse the data using via BeautifulSoup’s standard methods findAll(“element of interest”) or find(“element of interest”), irrespective of the HTML parser specified with BeautifulSoup (“html”, “xml”, “lxml”, “html5lib”). I parsed the parental element of the svg element to get an overview about the retrieved html document and to pinpoint the problem. Apparently text contents of variables inside the svg element were embedded as variables themselves and not showing the actual text. E.g., class =“femalePercentage”3.0% was represented as femalePercentageLabels=bar.append(‘text’) inside the retrieved html document and not as expected from the browser developer tool screening. I still haven’t figured why the retrieved html document is different from the html data structure returned by the browser developer tool but I assume that the graphical representation of data interferes with parsing. I was very relieved to find the relevant data -not the percentages but the actual numbers- also represented in JSON format inside the script tag within the HTML document, which I was able to parse using BeautifulSoup. Nonetheless, the second approach also required some troubleshooting!! In order to access the data, calculate the appropriate population percentages for each age group, I had to convert the JSON object into a python dictionary, which is usually also very straightforward by using Python’s “json” module. Usually is the keyword!! Using the json.loads() method for JSON to dictionary conversion returned following error: “ValueError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)”. After some research, I found some help on stackoverflow. So apparently, json.loads() is not able to handle strings that have internal single quotes and external (the actual quotes denoting the string) double quotes. With this new information the error message now makes much more sense to me. In order to circumvent this problem I used the function ast.literal_eval(), which simply evaluates the input string and return the corresponding Python literal structure encoded inside the string (here a dictionary). Alternatively, I could have written code to substitute double to single quotes. These two challenges really took me a while and required quite some detective work.

Another problem or rather unexpected behaviour that I have experienced was with R’s hash library. Python’s dictionaries really offer great functionality and I was really excited to find R’s hash library that essentially provides very similar functionality. Using the hash() function, I wasn’t able to use “h” as a key in the standard notation hash(key=value). Following error was returned: “(The empty character string, ’’, cannot be used for a key at key(s): 1)”. However, I was able to use “h” as a key with the .set() notation .set(hash(), keys=c(), values=c()). I wasn’t able to figure out if this is a bug or actually a feature.

Apart from the mentioned challenges, I wasn’t really confronted with any major technical issues. In fact, being rather affiliated with Python including pandas, numpy and matplotlib library for (explorative) data analysis, I was extremely surprised how much functionality actually R’s dplyr, tidyr and reshape provide. This analysis required a lot of data rearrangement and reshaping. So I was really astonished how effortless this can be accomplished with the mentioned R libraries, especially using the %>% notation to concatenate many different steps. This highly reminded me of the SQL syntax. Even more so, I really learned to value the power and ability of ggplot in combination with cowplot and gridExtra to make any custom plot. Plotting in Python using matplotlib can really be very cumbersome if you want to make custom changes or add custom features to plots. Again, the concatenating functionality in ggplot (+ parameter) to add different layers, really made it easy to construct any desired custom plots. The only time I was frustrated by the functionality of ggplot/gridExtra was in context with making a multiplot with three figures and one joint figure legend. Having one legend constructed with one plot but serving as a joint legend in the multiplot, messed up the alignment of the plots. This took me some research on stackoverflow to figure out how to convert the figure legend to a plot object and simply include it in the multiplot as an individual figure, which would preserve the correct formatting. However, in the progress of the project I came across the cowplot package, which essentially provides exactly the mentioned formatting functionality. So overall, I am really convinced by the superiority of the mentioned packages for exploratory data analysis!!

Thoughts on Data

So what’s the conclusion? Is global warming responsible for increasing numbers of extreme weather events? The short answer is we can’t tell for sure! First things first, the data provides unequivocal evidence for global warming. We are experiencing an constant increase in average temperature over the past 100 years and even more intriguing, the data suggests humans to be a major driver for this trend by burning fossil fuels. Similarly, the data implies an increase in extreme weather events over the same period. So one might be tempted to infer direct correlation between temperature increase/global warming and the number of extreme weather events. However, from a scientific point of view, there are very likely many confounding variables that prevent us from drawing clear conclusions. Ironically and as already noted, the major difficulty in deducing unbiased correlations between temperature and the number of events is the constant improvement in technology. In each decade new technology has led to more sophisticated metrological equipment, which has led to increased sensitivity in measuring and detecting extreme weather, which has led to reporting more events. So essentially it is highly biased to compare numbers from earlier periods of the 20th century to numbers from the later period. So what can we do to obtain more reliable results? What I already tried was to look at the data from a different “perspective”, assuming less bias. E.g., I plotted the number of very high-damage events (arbitrary definition) over time, thus, ignoring “ordinary” events, which I believe are those mostly reflecting the detection capacity of each decades’ technology. Nonetheless, this approach still mostly reflects “gut feeling” and is certainly not scientifically valid (I have no data supporting the hypothesis of less detection bias with more severe environmental phenomena). Future work would most certainly benefit from more data on the detection procedure itself. E.g., having the number of detection units for different locations and years. Is the number of such units per location/year changing? Having this information, normalizing observations per unit, could definitely help to reduce detection bias. Going a step further, one could think of wrapping measurements, such as temperature, carbon dioxide emission, -concentration and number/category of extreme weather events, in an “experimental setup”“, i.e. using contemporary technology and a defined number of detection units in selected locations and conducting long-term measurements without changing the setup. This would produce comparable data that could be subjected to statistical analysis allowing most accurate interpretations of the results.

Finally, I want to reflect on surprising results. In the section “Can we analyse the data at reduced bias in reporting severe weather events”, I already elaborated on the surprising finding that the average Tornado path distance decreased markedly over time. Please refer to the corresponding section for more details. Also not expected are the results on the relative risk for different age groups to suffer consequences from extreme weather (either injuries or fatalities). Why is the relative risk decreased for age groups “up to 19” and “20-39”? For the former group I expect that the risk reduction is reflected by the “children first” rescue code, denoting that rescuing the lives of children has priority. However, the reduced risk for the latter group is really odd in my opinion. I can’t think of a logical explanation, since this peer group represents prime age adults (best physical condition and health on average), which I would expect to reflect population numbers at best. Based on statistics from the National Fire Protection Association (http://www.nfpa.org/) most fire fighters belong to the age group “20-39” and I assume similar statistics for other rescue professionals. As such I would rather expect higher risk for this age group! Maybe the age group “20-39” distribution is highly right skewed with most entries for individuals of age 20-25, which still might qualify as children for which I assume the “rescue children first policy”. The extreme weather events dataset contains age information at year-scale, so it makes sense to have a look at the actual distributions for each age groups in follow up studies. Age group “40-59” risk rate is fairly close to expected values and there is likely some inherent fluctuation depending on the population census data estimates. For all other age groups I assume that on average lower physical and health conditions of elderly people puts those groups at greater risk. In this context it would be interesting to compare direct versus indirect casualties. I would assume high numbers of indirect casualties for older age groups (because of physical conditions). This data is included in the dataset and the report will be updated according to a follow up study.

Sources used for this project

https://climate.nasa.gov/evidence/
http://www.bbc.com/future/story/20130410-the-greatest-show-on-earth
https://oceanservice.noaa.gov/education/pd/tidescurrents/tidescurrents_effects_influences.html
http://www.nssl.noaa.gov/education/svrwx101/
http://www.populationpyramid.net/united-states-of-america/
https://en.wikipedia.org/wiki/Tornado_Alley
https://en.wikipedia.org/wiki/Haversine_formula
https://www.ncdc.noaa.gov/temp-and-precip/state-temps/
http://cdiac.ornl.gov/CO2_Emission/timeseries/global
https://www.ncdc.noaa.gov/sotc/global/201613
http://onlinelibrary.wiley.com/doi/10.1002/2016GL071603/full
https://www.nature.com/articles/nature14491
http://www.iflscience.com/environment/atlantic-entering-cool-phase-will-change-world-s-weather/
http://www.iflscience.com/environment/atlantic-entering-cool-phase-will-change-world-s-weather/
http://stateoftheocean.osmc.noaa.gov/atm/amo.php
https://inflationdata.com/Inflation/Consumer_Price_Index/HistoricalCPI.aspx?reloaded=true
http://www.intellicast.com/storm/summary/
http://www.nssl.noaa.gov/education/svrwx101/thunderstorms/
http://www.geomidpoint.com/latlon.html
https://www.theguardian.com/us-news/2017/aug/29/total-harvey-cost-insurance-texas-tropical-storm-hurricane-sandy
https://www.nytimes.com/2017/08/28/business/dealbook/flood-insurance-harvey.html?mcubz=3
https://www1.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/
https://www.kaggle.com
http://www.nfpa.org/